File: README

package info (click to toggle)
schedtool 1.3.0-1
links: PTS
area: main
in suites: jessie, jessie-kfreebsd, squeeze, stretch, wheezy
size: 152 kB
ctags: 62
sloc: ansic: 523; makefile: 83
file content (378 lines) | stat: -rw-r--r-- 12,190 bytes
parent folder | download | duplicates (5)
schedtool
Copyright (C) 2002-2006 Freek
Release under GPL, version 2 (see LICENSE)
Use at your own risk.
Inspired by setbatch (C) 2002 Ingo Molnar
Suggestions are welcome.

CONTENT:
--------

-About
-Usage / description of schedtool
-A complex example
-Static Priority
-Policies reviewed
-Thanks
-Appendix A: A course into Multi-Level-Feedback-Queue-scheduling



ABOUT:
------

schedtool was born, because there was no tool to change or query
all CPU-scheduling policies under Linux, in one handy command.
Support for CPU-affinity has also been added and most recently
(re-)nicing of processes.
Thus, schedtool is the definitive interface to Linux's scheduler.

It can be used to avoid skipping for A/V-applications, to lock
processes onto certain CPUs on SMP/NUMA systems, which may be
beneficial for networking or benchmarks, or to adjust nice-levels
of lesser important jobs to maintain a high amount of interactive
responsiveness under high load.

All output, even errors, go to STDOUT to ease piping.

If you don't know about scheduling policies, you probably don't want to
use this program - or learn and read "man sched_setscheduler".

Certain modes (as of this writing: SCHED_IDLEPRIO and SCHED_ISO) need a
patched kernel. See INSTALL for details.



USAGE:
------

There are 3 operation modes: query, set and execute new process.

QUERY PROCESS(ES):

#> schedtool <LIST_OF_PIDs>

This will print all information it can obtain for the processes with
<PIDs>.


SET PROCESS(ES):

I) scheduling policy (detailed discussion in "POLICY OVERVIEW")

#> schedtool -<MODE> <LIST_OF_PIDs>

where <MODE> is one of:

	N or 0:		for SCHED_NORMAL
	F or 1:         for SCHED_FIFO
	R or 2:		for SCHED_RR
	B or 3:		for SCHED_BATCH
	I or 4:		for SCHED_ISO

example:
#> schedtool -B <PIDs>


II) static priority

.. is mandatory for SCHED_FIFO and SCHED_RR
STATIC_PRIO is a number from 1-99; higher values mean higher priority in
that scheduling class (relative to other processes in the same class).

example:
#> schedtool -R -p 20 <PIDs>

III) CPU-affinity

example:
#> schedtool -a 0x3 <PIDs>

IV) nice-level

example
#> schedtool -n 10 <PIDs>


Of course you can combine policy with affinity and nice in one call.


EXECUTE A NEW PROCESS:

example:
#> schedtool [SCHED_PARAMETERS_LIKE_ABOVE] -e command -arg1 -arg2 file

This will execute "command -arg1 -arg2 file" like typing exactly this
on the prompt would.



CPU-affinity:

To give PIDs/a command a certain CPU-affinity, use the -a switch.
The value is used as a simple bitmask, the bit set to 1 denoting the
PID may run on that CPU, the bit unset (0) denoting it MUST NOT.

The following picture uses only 16 bits for example purpose.
The resulting value is the bitwise OR of the single values for each CPU.
CPU0 (the first CPU in your system) is denoted by the least significant bit
(here, the one on the right side).

  CPU 0-----------,
 CPU 1-----------,|
  ...            ||
mask             VV		means				value  == dec
--------------------------------------------------------------------------
0000 0000 0000 0001	->	run only on CPU0	->	0x1    ==  1
0000 0000 0000 1001	->	run on CPU0 AND CPU3	->	0x9    ==  9
0000 0000 0000 1111	->	run on CPU0-CPU4	->	0xF    == 15
1111 1111 1111 1111	->	run on CPU0-CPU15	->	0xFFFF ==2^16-1

To set back to the default (PID may run on all CPUs), use the mask
0xFFFFFFFF (the kernel will automatically reduce it to the max # of cpus)

As a short mnemonic rule, each 'F' denotes a set of 4 CPUs
(0xF: all 4 CPUs, 0xFF: all 8 CPUs, and so on ...)


Since version 1.1.1 a new list mode is supported, allowing you to
specify the target-CPUs without doing bitjuggling. To separate the
different CPUs, use a ',':

Run on CPU0 and CPU1:
#> schedtool -a 0,1 <PIDs>



A COMPLEX EXAMPLE:
------------------
#> schedtool -R -p 50 -a 0x2 -e mplayer file.avi

Execute mplayer file.avi with
	-SCHED_RR,
	-static priority 50,
	-affinity 0x2 (run only on CPU1).



ABOUT STATIC PRIORITY:
----------------------
Static priority is something completely different than the nice-level; the
nice-level is added to the dynamic priority, and the higher it gets, the more
the process is "punished"([2]), whereas the static priority is used to find
the next process to run in the current scheduling class and the higher it is
the more preferred >in general< the process is over others, e.g. when
it's becoming ready after a blocking action. It will/may also preempt
another, lower-prioritized process.

STATIC_PRIO can't be assigned to SCHED_NORMAL or SCHED_BATCH. The
code won't prevent this (a warning is printed - think UNIX), you maybe get an
error later at the setting-call.

v1.2.4+ support a probe mode like sched-utils; it will display each policy's
min and max priority, when given the -r parameter.

#> schedtool -r
N: SCHED_NORMAL  : prio_min 0, prio_max 0
F: SCHED_FIFO    : prio_min 1, prio_max 99
R: SCHED_RR      : prio_min 1, prio_max 99
B: SCHED_BATCH   : prio_min 0, prio_max 0
I: SCHED_ISO     : policy not implemented
D: SCHED_IDLEPRIO: policy not implemented



POLICY OVERVIEW + WHERE TO USE:
-------------------------------
SCHED_NORMAL
is the standard scheduling policy and good for the average
job with reasonable interaction.


SCHED_FF and SCHED_RR
are for real-time constraints.
Don't use them for normal stuff, because they've got extremely short
time-slices increasing the context-switching overhead and they won't
let other processes run until they get blocked by a system-call like
read() or actively free themselves from the CPU via the system-call
sched_yield(2).


SCHED_BATCH
is encuraged for long-running and non-interactive
processes; the timeslice is considerably longer (1.5s I think) -
these processes, though, are interrupted almost anytime by other ones to
guarantee interactiveness.
Processes won't get any interactive boosts.

Users are encouraged to set their computing jobs to SCHED_BATCH. Or, as
admin of a compute-server, you could set their shells to SCHED_BATCH
via the login-script.
SCHED_BATCH has been included in 2.6.16+ kernels.


SCHED_ISO [patch needed, see INSTALL]
is a new mode, currently only in Con's patches, to mimick the
real-time class for non-root users. To quote Con:

"Any task trying to start as real time that doesn't have authority to do so
will be set to SCHED_ISO. This is a non-expiring scheduler policy designed to
guarantee a timeslice within a reasonable latency while preventing starvation.
Good for gaming, video at the limits of hardware, video capture etc.
It is best set using the schedtool by a normal user trying to start something
as SCHED_RR." [ http://kerneltrap.org/node/view/2159 ]

SCHED_ISO is now somewhat deprecated; SCHED_RR is now possible for normal users,
albeit to a limited amount only. See newer kernels.


SCHED_IDLEPRIO [patch needed, see INSTALL]
SCHED_IDLEPRIO was formerly called SCHED_BATCH in the -ck patchset; the
-ck SCHED_BATCH has nothing to do with the mainline SCHED_BATCH!
It is a policy where the process does not get any interactive boost
(through sleeping etc) and also only the idle CPU time.

For more information you can read the file SCHED_DESIGN as a good overview, but
be warned, that *some* things may be outdated by the new O(1)-patches.
Then proceed to the man-page for sched_setscheduler(2) - it gives a very good
overview and is _highly_ recommended.



FINAL WORDS / CONTACT:
----------------------
If you feel you are able to make this software better or you can report
some numbers with the different scheduling policies, please contact me.
Feedback is appreciated.
Please use freshmeat.net's "contact author"-feature to do so.



THANKS:
-------
Thanks fly out to (in no particular order)

o Ingo Molnar
o Con Kolivas for suggesting the -e switch, submitting patch for SCHED_ISO
o Samuli Krkkinen, the quality-verification-engineer
o my girlfriend and supporting friends



- -- - -- -

[2]:
A bit simplified - it's not all that easy :-) Go on to Appendix A for
an example on how scheduling is performed in Solaris.

[3]: (see also [4])
Nice level and dynamic priority are somewhat "strange": sometimes, higher
values mean higher priority (to be put on CPU when process ready); sometimes,
lower values mean higher priority.

At the moment, I confirm my system being the following:
- The nice-level is SUBSTRACTED from the dynamic priority, thus giving a
Process nice -10 means INCREASING it's priority (valuewise) by 10 points.
- In the end, higher values mean higher priority.
Use "ps -eO pri,nice" and look for yourself.



APPENDIX A: INTRODUCTION TO MULTI-LEVEL-QUEUE-FEEDBACK-SCHEDULING
-----------------------------------------------------------------
This appendix uses information originating from the "System
Programming I"-course at my university; the examples are using Solaris
2.X, but I think, Linux is doing it in a >similar< (albeit not that
overcomplicated) way.

Solaris has 60 wait queues for the class TS (TimeSharing); there are
other classes as system and RealTime as well.
A TS-queue looks like this (which is basically a set of rules):

Level    ts_quantum    ts_tqexp     ts_maxwait    ts_lwait   ts_slpret
0        200           0            0             50         50
.        .             .            .             .          .
.        .             .            .             .          .
.        .             .            .             .          .
44       40            34           0             55         55
45       40            35           0             56         56
.        .             .            .             .          .
.        .             .            .             .          .
.        .             .            .             .          .
59       20            49           32000         59         59


You can display all these numbers on a Solaris-box using
# dispadmin -c TS -g

Level:
	just a queue ID.

ts_quantum:
	the maximum timeslice - the maximum time, the process
	is allowed to run continuously until it's interrupted and another
	process is physically put on the CPU.

ts_tqexp:
	if the process uses it's timeslice entirely, it's put into that
	queue [cf. Level].

ts_maxwait:
	maximum time to wait for the process in that queue without being
	run, in seconds.

ts_lwait:
	if one process stays too long in the current queue, it's put
	into that queue.
ts_slpret:
	queue to put the process in after it was blocked, e.g. in a
	syscall.


Let's start an imaginary process and look what's happening:
Start ->
-> queue 59, ts_quantum  20ms -> queue 49, ts_quantum  40ms
-> queue 39, ts_quantum  80ms -> queue 29, ts_quantum 120ms
-> queue 19, ts_quantum 160ms -> queue  9, ts_quantum 200ms

You see how this only number-crunching process is put into queues that
allow him to use the CPU for more and more time.


Now let's do the process a blocking action:

queue  0, ts_quantum 200ms, after e.g. 100ms blocking call! -->
queue 50, ts_quantum  40ms, after e.g.  20ms blocking call! -->
queue 58, ts_quantum  40ms

Now you see how this process is "punished", or from another point of
view, the scheduler thinks, this is an interactive process computing a
bit and then outputting data, so it's put into a queue that has
averagely the same computing time until a this output occurs.
So the schedulers knows pretty much about the current state of the
machine and can plan accordingly.

The dynamic priority is something like an age - the higher[4] it is the more
likely you get a seat :) (the CPU).
There are 4 rules:
-if a process is not run, it ages - the dynamic priority rises.
-if a process is running, the dynamic priority is lowered.
-the process with the (at the moment) highest priority is put onto the CPU.
-processes with lower priority are/can be interrupted by processes with
higher priority.

This guarantees that no process is running for too long and others are
waiting for too long.

	-End Of Documentation

- -- -

[4]:
higher (priority) in means of more important to run in the near future;
higher does not automatically mean a higher value in it's PCB[5]

[5]:
Process Control Block, some structure where important accounting and
other useful information are stored, usually only used by the kernel.