As described above, the scheduler is invoked before returning to user mode after system calls or at
certain designated points in the kernel. This ensures that the kernel, unlike user processes, cannot be
interrupted unless it explicitly wants to be. This behavior can be problematic if the kernel is in the middle
of a relatively long operation — this may well be the case with filesystem, or memory-managementrelated
tasks. The kernel is executing on behalf of a specific process for a long amount of time, and other
processes do not get to run in the meantime. This may result in deteriorating system latency, which users
experience as ‘‘sluggish‘‘ response. Video and audio dropouts may also occur in multimedia applications
if they are denied CPU time for too long.
These problems can be resolved by compiling the kernel with support for kernel preemption. This allows
not only userspace applications but also the kernel to be interrupted if a high-priority process has some
things to do. Keep in mind that kernel preemption and preemption of userland tasks by other userland
tasks are two different concepts!
Kernel preemption was added during the development of kernel 2.5. Although astonishingly few
changes were required to make the kernel preemptible, the mechanism is not as easy to implement
as preemption of tasks running in userspace. If the kernel cannot complete certain actions in a single
operation — manipulation of data structures, for instance — race conditions may occur and render the
system inconsistent.
The kernel may not, therefore, be interrupted at all points. Fortunately, most of these points have already
been identified by SMP implementation, and this information can be reused to implement kernel preemption.
Problematic sections of the kernel that may only be accessed by one processor at a time are
protected by so-called spinlocks: The first processor to arrive at a dangerous (also called critical) region
acquires the lock, and releases the lock once the region is left again. Another processor that wants to
access the region in the meantime has to wait until the first user has released the lock. Only then can it
acquire the lock and enter the dangerous region.
If the kernel can be preempted, even uniprocessor systems will behave like SMP systems. Consider that
the kernel is working inside a critical region when it is preempted. The next task also operates in kernel
mode, and unfortunately also wants to access the same critical region. This is effectively equivalent to
two processors working in the critical region at the same time and must be prevented. Every time the
kernel is inside a critical region, kernel preemption must be disabled.
How does the kernel keep track of whether it can be preempted or not? Recall that each task in the system
is equipped with an architecture-specific instance of struct thread_info.
The value of this element determines whether the kernel is currently at a position where it may be interrupted.
If preempt_count is zero, the kernel can be interrupted, otherwise not. The value must not be
manipulated directly, but only with the auxiliary functions dec_preempt_count and inc_preempt_count,
which, respectively, decrement and increment the counter. inc_preempt_count is invoked each time
the kernel enters an important area where preemption is forbidden. When this area is exited, dec_
preempt_count decrements the value of the preemption counter by 1. Because the kernel can enter some
important areas via different routes — particularly via nested routes — a simple Boolean variable would
not be sufficient for preempt_count. When multiple dangerous regions are entered one after another, it
must be made sure that all of them have been left before the kernel can be preempted again.
The dec_preempt_count and inc_preempt_count calls are integrated in the synchronization operations
for SMP systems (see Chapter 5). They are, in any case, already present at all relevant points of
the kernel so that the preemption mechanism can make best use of them simply by reusing the existing
infrastructure.
Some more routines are provided for preemption handling:
❑ preempt_disable disables preemption by calling inc_preempt_count. Additionally, the compiler
is instructed to avoid certain memory optimizations that could lead to problems with the
preemption mechanism.
❑ preempt_check_resched checks if scheduling is necessary and does so if required.
❑ preempt_enable enables kernel preemption, and additionally checks afterward if rescheduling
is necessary with preempt_check_resched.
❑ preempt_disable_no_resched disables preemption, but does not reschedule.