mati865/scx
sched_ext schedulers and tools
Sched_ext Schedulers and Tools
sched_ext is a Linux kernel feature
which enables implementing kernel thread schedulers in BPF and dynamically
loading them. This repository contains various scheduler implementations and
support utilities.
sched_ext enables safe and rapid iterations of scheduler implementations, thus
radically widening the scope of scheduling strategies that can be experimented
with and deployed; even in massive and complex production environments.
You can find more information, links to blog posts and recordings, in the wiki.
The following are a few highlights of this repository.
- The scx_layered case
study
concretely demonstrates the power and benefits of sched_ext. - For a high-level but thorough overview of the sched_ext (especially its
motivation), please refer to the overview document. - For a description of the schedulers shipped with this tree, please refer to
the schedulers document. - The following video is the scx_rustland
scheduler which makes most scheduling decisions in userspace Rust code showing
better FPS in terraria while kernel is being compiled. This doesn't mean that
scx_rustland is a better scheduler but does demonstrate how safe and easy it is to
implement a scheduler which is generally usable and can outperform the default
scheduler in certain scenarios.
scx_rustland-terraria.mp4
sched_ext is supported by the upstream kernel starting from version 6.12. Both
Meta and Google are fully committed to sched_ext and Meta is in the process of
mass production deployment. See (#kernel-feature-status) for more details.
In all example shell commands, $SCX refers to the root of this repository.
Getting Started
All that's necessary for running sched_ext schedulers is a kernel with
sched_ext support and the scheduler binaries along with the libraries they
depend on. Switching to a sched_ext scheduler is as simple as running a
sched_ext binary:
root@test ~# cat /sys/kernel/sched_ext/state /sys/kernel/sched_ext/*/ops 2>/dev/null
disabled
root@test ~# scx_simple
local=1 global=0
local=74 global=15
local=78 global=32
local=82 global=42
local=86 global=54
^Zfish: Job 1, 'scx_simple' has stopped
root@test ~# cat /sys/kernel/sched_ext/state /sys/kernel/sched_ext/*/ops 2>/dev/null
enabled
simple
root@test ~# fg
Send job 1 (scx_simple) to foreground
local=635 global=179
local=696 global=192
^CEXIT: BPF scheduler unregistered
scx_simple
is a very simple global vtime scheduler which can behave acceptably on CPUs
with a simple topology (single socket and single L3 cache domain).
Above, we switch the whole system to use scx_simple by running the binary,
suspend it with ctrl-z to confirm that it's loaded, and then switch back
to the kernel default scheduler by terminating the process with ctrl-c.
For scx_simple, suspending the scheduler process doesn't affect scheduling
behavior because all that the userspace component does is print statistics.
This doesn't hold for all schedulers.
In addition to terminating the program, there are two more ways to disable a
sched_ext scheduler - sysrq-S and the watchdog timer. Ignoring kernel
bugs, the worst damage a sched_ext scheduler can do to a system is starving
some threads until the watchdog timer triggers.
As illustrated, once the kernel and binaries are in place, using sched_ext
schedulers is straightforward and safe. While developing and building
schedulers in this repository isn't complicated either, sched_ext makes use
of many new BPF features, some of which require build tools which are newer
than what many distros are currently shipping. This should become less of an
issue in the future. For the time being, the following custom repositories
are provided for select distros.
Install Instructions by Distro
Repository Structure
scx
|-- scheds : Sched_ext scheduler implementations
| |-- include : Shared BPF and user C include files including vmlinux.h
| |-- c : Example schedulers - userspace code written C
| \-- rust : Example schedulers - userspace code written Rust
\-- rust : Rust support code
\-- scx_utils : Common utility library for rust schedulers
Build & Install
meson is the main build system but each Rust sub-project is its own
self-contained cargo project and can be built and published separately. The
following are the dependencies and version requirements.
Note: Many distros only have earlier versions of meson, in that case just clone the meson
repo and call
meson.py e.g. /path/to/meson/repo/meson.py compile -C build. Alternatively, use pip e.g.
pip install meson or pip install meson --break-system-packages (if needed).
meson: >=1.2, build scripts undermeson-scripts/usebashand
standard utilities includingawk.clang: >=16 required, >=17 recommendedlibbpf: >=1.2.2 required, >=1.3 recommended (RESIZE_ARRAYsupport is
new in 1.3). It's preferred to link statically against the source from the libbpf git repo, which is cloned during setup.- Rust toolchain: >=1.82
libelf,libz,libzstdif linking against staticlibbpf.abpftoolBy default this is cloned and built as part of the default build process. Alternatively it's usually available inlinux-tools-common.
The kernel has to be built with the following configuration:
CONFIG_BPF=yCONFIG_BPF_SYSCALL=yCONFIG_BPF_JIT=yCONFIG_DEBUG_INFO_BTF=yCONFIG_BPF_JIT_ALWAYS_ON=yCONFIG_BPF_JIT_DEFAULT_ON=yCONFIG_SCHED_CLASS_EXT=y
The scx/kernel.config file includes all required and other recommended options for using sched_ext.
You can append its contents to your kernel .config file to enable the necessary features.
Setting Up and Building
meson always uses a separate build directory. Running the following
commands in the root of the tree builds and installs all schedulers under
~/bin.
Static linking against libbpf (preferred)
$ cd $SCX
$ meson setup build --prefix ~
$ meson compile -C build
$ meson install -C buildNotes: meson setup will also clone both libbpf and bpftool repos and meson compile will build them both.
Make sure you have dependencies installed that allow you to compile from source!
Ubuntu/Debian
apt install build-essential libssl-dev llvm lld libelf-dev meson cargo rustc clang llvm cmake pkg-config protobuf-compilerArch Linux
pacman -S base-develStatic linking against system libbpf
Note, depending on your system configuration libbpf_a and libbpf_h may be
in different directories. The system libbpf version needs to match the minimum
libbpf version for scx.
$ cd $SCX
$ meson setup build --prefix ~ -D libbpf_a=/usr/lib64/libbpf.a -D libbpf_h=/usr/include/bpf/
$ meson compile -C build
$ meson install -C buildDynamic linking against libbpf
$ cd $SCX
$ meson setup build --prefix ~ -D libbpf_a=disabled
$ meson compile -C build
$ meson install -C buildUsing a different bpftool
This will check the system for an installed bpftool
$ meson setup build --prefix ~ -D bpftool=disabledUsing a custom built bpftool
$ meson setup build --prefix ~ -D bpftool=/path/to/bpftoolNote that meson compile step is not strictly necessary as install
implies compile. The above also will build debug binaries with
optimizations turned off, which is useful for development but they aren't
optimized and big. For actual use you want to build release binaries.
meson uses -D argument to specify build options. The configuration
options can be specified at setup time but can also be changed afterwards
and meson will do the right thing. To switch to release builds, run the
following in the build directory and then compile and install again.
$ meson configure -Dbuildtype=releaseRunning meson configure without any argument shows all current build
options. For more information on meson arguments and built-in options,
please refer to meson --help and its
documentation.
Building Specific Schedulers and Binary Locations
If you just want to build a subset of schedulers, you can specify the
scheduler names as arguments to meson compile. For example, if we just
want to build the simple example scheduler
scheds/c/scx_simple and the Rust userspace scheduler
scheds/rust/scx_rusty:
$ cd $SCX
$ meson setup build -Dbuildtype=release
$ meson compile -C build scx_simple scx_rustysccache installed: meson automatically
uses sccache if available. However, sccache fails in one of the build
steps. If you encounter this issue, disable sccache by specifying CC
directly - $ CC=clang meson setup build -Dbuildtype=release.
You can also specify -v if you want to see the commands being used:
$ meson compile -C build -v scx_pairFor C userspace schedulers such as the ones under scheds/c,
the built binaries are located in the same directory under the build root.
For example, here, the scx_simple binary can be found at
$SCX/build/scheds/c/scx_simple.
For Rust userspace schedulers such as the ones under scheds/rust, the
scx_rusty binary can be found at $SCX/build/scheds/rust/release.
SCX specific build options
While the default options should work in most cases, it may be desirable to
override some of the toolchains and dependencies - e.g. to directly use
libbpf built from the kernel source tree. The following meson build
options can be used in such cases.
bpf_clang:clangto use when compiling.bpf.cbpftool:bpftoolto use when generating.bpf.skel.h. Set this to "disabled" to check the system for an already installed bpftoollibbpf_a: Staticlibbpf.ato use. Set this to "disabled" to link libbpf dynamicallylibbpf_h:libbpfheader directories, only meaningful withlibbpf_aoptioncargo:cargoto use when building Rust sub-projectscargo_home:CARGO_HOMEenv to use when invokingcargooffline: Compilation step should not access the internetenable_rust: Enable the build of Rust sub-projects
For example, let's say you want to use bpftool and libbpf shipped in the
kernel tree located at $KERNEL. We need to build bpftool in the kernel
tree first, set up SCX build with the related options and then build &
install.
$ cd $KERNEL
$ make -C tools/bpf/bpftool
$ cd $SCX
$ BPFTOOL=$KERNEL/tools/bpf/bpftool
$ meson setup build -Dbuildtype=release -Dprefix=~/bin \
-Dbpftool=$BPFTOOL/bpftool \
-Dlibbpf_a=$BPFTOOL/libbpf/libbpf.a \
-Dlibbpf_h=$BPFTOOL/libbpf/include
$ meson install -C buildNote that we use libbpf which was produced as a part of bpftool build
process rather than building libbpf directly. This is necessary because
libbpf header files need to be installed for them to be in the expected
relative locations.
Offline Compilation
Rust builds automatically download dependencies from crates.io; however,
some build environments might not allow internet access requiring all
dependencies to be available offline. The fetch target and offline
option are provided for such cases.
The following downloads all Rust dependencies into $HOME/cargo-deps.
$ cd $SCX
$ meson setup build -Dcargo_home=$HOME/cargo-deps
$ meson compile -C build fetchThe following builds the schedulers without accessing the internet. The
build directory doesn't have to be the same one. The only requirement is
that the cargo_home option points to a directory which contains the
content generated from the previous step.
$ cd $SCX
$ meson setup build -Dcargo_home=$HOME/cargo-deps -Doffline=true -Dbuildtype=release
$ meson compile -C buildWorking with Rust Sub-projects
Each Rust sub-project is its own self-contained cargo project. When building
as a part of this repository, meson invokes cargo with the appropriate
options and environment variables to sync the build environment. When
building separately by running cargo build directly in a sub-project
directory, it will automatically figure out build environment. Please take a
look at the
scx_utils::BpfBuilder
documentation for details.
For example, the following builds and runs the scx_rusty scheduler:
$ cd $SCX/scheds/rust/scx_rusty
$ cargo build --release
$ cargo run --releaseHere too, the build step is not strictly necessary as it's implied by
run.
Note that Rust userspace schedulers are published on crates.io and can be
built and installed without cloning this repository as long as the necessary
toolchains are available. Simply run:
$ cargo install --locked scx_rustyand scx_rusty will be built and installed as ~/.cargo/bin/scx_rusty.
Checking scx_stats
With the implementation of scx_stats, schedulers no longer display statistics by default. To display the statistics from the currently running scheduler, a manual user action is required.
Below are examples of how to do this.
- To check the scheduler statistics, use the
scx_SCHEDNAME --monitor $INTERVALfor example 0.5 - this will print the output every half a second
scx_bpfland --monitor 0.5Some schedulers may implement different or multiple monitoring options. Refer to --help of each scheduler for details.
Most schedulers also accept --stats $INTERVAL to print the statistics directly from the scheduling instance.
Examples
- scx_bpfland
$ scx_bpfland --monitor 5
[scx_bpfland] tasks -> run: 3/4 int: 2 wait: 3 | nvcsw: 3 | dispatch -> dir: 0 prio: 73 shr: 9
[scx_bpfland] tasks -> run: 4/4 int: 2 wait: 2 | nvcsw: 3 | dispatch -> dir: 1 prio: 3498 shr: 1385
[scx_bpfland] tasks -> run: 4/4 int: 2 wait: 2 | nvcsw: 3 | dispatch -> dir: 1 prio: 2492 shr: 1311
[scx_bpfland] tasks -> run: 4/4 int: 2 wait: 3 | nvcsw: 3 | dispatch -> dir: 2 prio: 3270 shr: 1748- scx_rusty
$ scx_rusty --monitor 5
###### Thu, 29 Aug 2024 14:42:37 +0200, load balance @ -265.1ms ######
cpu= 0.00 load= 0.17 mig=0 task_err=0 lb_data_err=0 time_used= 0.0ms
tot= 15 sync_prev_idle= 0.00 wsync= 0.00
prev_idle= 0.00 greedy_idle= 0.00 pin= 0.00
dir= 0.00 dir_greedy= 0.00 dir_greedy_far= 0.00
dsq=100.00 greedy_local= 0.00 greedy_xnuma= 0.00
kick_greedy= 0.00 rep= 0.00
dl_clamp=33.33 dl_preset=93.33
slice=20000us
direct_greedy_cpus=f
kick_greedy_cpus=f
NODE[00] load= 0.17 imbal= +0.00 delta= +0.00
DOM[00] load= 0.17 imbal= +0.00 delta= +0.00- scx_lavd
$ scx_lavd --monitor 5
| 12 | 1292 | 3 | 1 | 8510 | 37.6028 | 2.42068 | 99.1304 | 100 | 62.8907 | 100 | 100 | 62.8907 | performance | 100 | 0 | 0 |
| 13 | 2208 | 3 | 1 | 6142 | 33.3442 | 2.39336 | 98.7626 | 100 | 60.2084 | 100 | 100 | 60.2084 | performance | 100 | 0 | 0 |
| 14 | 941 | 3 | 1 | 5223 | 31.323 | 1.704 | 99.215 | 100.019 | 59.1614 | 100 | 100.019 | 59.1614 | performance | 100 | 0 | 0 |- scx_rustland
$ scx_rustland --monitor 5
[RustLand] tasks -> r: 1/4 w: 3 /3 | pf: 0 | dispatch -> u: 4 k: 0 c: 0 b: 0 f: 0 | cg: 0
[RustLand] tasks -> r: 1/4 w: 2 /2 | pf: 0 | dispatch -> u: 28385 k: 0 c: 0 b: 0 f: 0 | cg: 0
[RustLand] tasks -> r: 0/4 w: 4 /0 | pf: 0 | dispatch -> u: 25288 k: 0 c: 0 b: 0 f: 0 | cg: 0
[RustLand] tasks -> r: 0/4 w: 2 /0 | pf: 0 | dispatch -> u: 30580 k: 0 c: 0 b: 0 f: 0 | cg: 0
[RustLand] tasks -> r: 0/4 w: 2 /0 | pf: 0 | dispatch -> u: 30824 k: 0 c: 0 b: 0 f: 0 | cg: 0
[RustLand] tasks -> r: 1/4 w: 1 /1 | pf: 0 | dispatch -> u: 33178 k: 0 c: 0 b: 0 f: 0 | cg: 0systemd services
See: services
Kernel Feature Status
The kernel feature is not yet upstream and can be found in the
sched_ext repository. The
following are important branches:
sched_ext: The main development
branch. This branch periodically pulls from the
bpf-next
tree to stay in sync with the kernel and BPF developments.sched_ext-release-*: sched_ext backports on top of released kernels. We
plan to maintain backports for a few recent kernel releases until
sched_ext is merged upstream. Currently maintained backports:sched_ext-vN: Patchsets posted upstream. The v4 LKML thread has
high-level discussions.
Breaking Changes
A list of the breaking changes in the sched_ext kernel tree and the associated commits for the schedulers in this repo.
Developer Guide
Want to learn how to develop a scheduler or find some useful tools for working
with schedulers? See the developer guide for more details.
Getting in Touch
We aim to build a friendly and approachable community around sched_ext. You
can reach us through the following channels:
GitHub: https://github.com/sched-ext/scxDiscord: https://discord.gg/b2J8DrWa7tMailing List: sched-ext@lists.linux.dev (for kernel development)
We also hold weekly office hours every Tuesday. Please see the #office-hours
channel on Discord for details.
Additional Resources
There are blog posts and articles about sched_ext, which helps you to explore
sched_ext in various ways. Followings are some examples:
- LWN: The extensible scheduler class (February, 2023)
- arighi's blog: Implement your own kernel CPU scheduler in Ubuntu with sched-ext (July, 2023)
- Changwoo's blog: sched_ext: a BPF-extensible scheduler class (Part 1) (December, 2023)
- arighi's blog: Getting started with sched-ext development (April, 2024)
- Changwoo's blog: sched_ext: scheduler architecture and interfaces (Part 2) (June, 2024)
- arighi's YT channel: scx_bpfland Linux scheduler demo: topology awareness (August, 2024)