Linux: Optimizing Linux Performance

2.2. Linux Performance Tools: CPU

Here begins our discussion of performance tools that enable you to extract information previously described.

2.2.1. vmstat (Virtual Memory Statistics)

vmstat stands for virtual memory statistics, which indicates that it will give you information about the virtual memory system performance of your system. Fortunately, it actually does much more than that. vmstat is a great command to get a rough idea of how your system performs as a whole. It tells you

How many processes are running
How the CPU is being used
How many interrupts the CPU receives
How many context switches the scheduler performs

It is an excellent tool to use to get a rough idea of how the system performs.

2.2.1.1 CPU Performance- Related Options

vmstat can be invoked with the following command line:

vmstat [-n] [-s] [delay [count]]

vmstat can be run in two modes: sample mode and average mode. If no parameters are specified, vmstat stat runs in average mode, wherevmstat displays the average value for all the statistics since system boot. However, if a delay is specified, the first sample will be the average since system boot, but after that vmstat samples the system every delay seconds and prints out the statistics. Table 2-1 describes the options that vmstat accepts.

Table 2-1. `vmstat` Command-Line Options

Option	Explanation
`-n`	By default, `vmstat` periodically prints out the column headers for eachperformance statistic. This option disables that feature so that after the initial header, only performance data displays. This proves helpful if you want to import the output of `vmstat` into a spreadsheet.
`-s`	This displays a one-shot details output of system statistics that `vmstat`gathers. The statistics are the totals since the system booted .
`delay`	This is the amount of time between `vmstat` samples.

vmstat provides a variety of different output statistics that enable you to track different aspects of the system performance. Table 2-2 describes those related to CPU performance. The next chapter covers those related to memory performance.

Table 2-2. CPU-Specific `vmstat` Output

Column	Explanation
`r`	This is the number of currently runnable processes. These processes are not waiting on I/O and are ready to run. Ideally, the number of runnable processes would match the number of CPUs available.
`b`	This is the number of processes blocked and waiting for I/O to complete.
`forks`	The is the number of times a new process has been created.
`in`	This is the number of interrupts occurring on the system.
`cs`	This is the number of context switches happening on the system.
`us`	The is the total CPU time as a percentage spent on user processes (including "nice" time).
`sy`	The is the total CPU time as a percentage spent in system code. This includes time spent in the `system` , `irq` , and `softirq` state.
`wa`	The is the total CPU time as a percentage spent waiting for I/O.
`id`	The is the total CPU time as a percentage that the system is idle.

vmstat provides a good low-overhead view of system performance. Because all the performance statistics are in text form and are printed to standard output, it is easy to capture the data generated during a test and process or graph it later. Because vmstat is such a low-overhead tool, it is practical to keep it running on a console or in a window even on a very heavily loaded server when you need to monitor the health of the system at a glance.

2.2.1.2 Example Usage

As shown in Listing 2.2, if vmstat is run with no command-line parameters, it displays the average values for the statistics that it records since the system booted. This example shows that the system was nearly idle since boot, as indicated by the CPU usage columns , under us , sys , wa , and id . The CPU spent 5 percent of the time since boot on user application code, 1 percent on system code, and the rest, 94 percent sitting idle.

Listing 2.2.

[ezolt@scrffy tmp]$ vmstat procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 181024 26284 35292 503048 0 0 3 2 6 1 5 1 94 0

Although vmstat 's statistics since system boot can be useful to determine how heavily loaded the machine has been, vmstat is most useful when it runs in sampling mode, as shown in Listing 2.3. In sampling mode, vmstat prints the systems statistics after the number of seconds passed with the delay parameter. It does this sampling count a number of times. The first line of statistics in Listing 2.3 contains the system averages since boot, as before, but then the periodic sample continues after that. This example shows that there is very little activity on the system. We can see that no processes were blocked during the run by looking at the 0 in the b . We can also see, by looking in the r column, that fewer than 1 processes were running when vmstat sampled its data.

Listing 2.3.

[ezolt@scrffy tmp]$ vmstat 2 5 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 181024 26276 35316 502960 0 0 3 2 6 1 5 1 94 0 1 0 181024 26084 35316 502960 0 0 0 0 1318 772 1 0 98 0 0 0 181024 26148 35316 502960 0 0 0 24 1314 734 1 0 98 0 0 0 181024 26020 35316 502960 0 0 0 0 1315 764 2 0 98 0 0 0 181024 25956 35316 502960 0 0 0 0 1310 764 2 0 98 0

vmstat is an excellent way to record how a system behaves under a load or during a test condition. You can use vmstat to display how the system is behaving and, at the same time, save the result to a file by using the Linux tee command. (Chapter 8, "Utility Tools: Performance Tool Helpers," describes the tee command in more detail.) If you only pass in the delay parameter, vmstat will sample indefinitely. Just start it before the test, and interrupt it after the test has completed. The output file can be imported into a spreadsheet, and used to see how the system reacts to the load or various system events. Listing 2.4 shows the output of this technique. In this example, we can look at the interrupt and context switches that the system is generating. We can see the total number of interrupts and context switches in the in and cs columns respectively.

The number of context switches looks good compared to the number of interrupts. The scheduler is switching processes less than the number of timer interrupts that are firing. This is most likely because the system is nearly idle, and most of the time when the timer interrupt fires, the scheduler does not have any work to do, so it does not switch from the idle process.

(Note: There is a bug in the version of vmstat that generated the following output. It causes the system average line of output to display incorrect values. This bug has been reported to the maintainer of vmstat and will be fixed soon, hopefully.)

Listing 2.4.

[ezolt@scrffy ~/edid]$ vmstat 1 | tee /tmp/output procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 201060 35832 26532 324112 0 0 3 2 6 2 5 1 94 0 0 0 201060 35888 26532 324112 0 0 16 0 1138 358 0 0 99 0 0 0 201060 35888 26540 324104 0 0 0 88 1163 371 0 0 100 0 0 0 201060 35888 26540 324104 0 0 0 0 1133 345 0 0 100 0 0 0 201060 35888 26540 324104 0 0 0 60 1174 351 0 0 100 0 0 0 201060 35920 26540 324104 0 0 0 0 1150 408 0 0 100 0 [Ctrl-C]

More recent versions of vmstat can even extract more detailed information about a grab bag of different system statistics, as shown in Listing 2.5.

The next chapter discusses the memory statistics, but we look at the CPU statistics now. The first group of statistics, or "CPU ticks ," shows how the CPU has spent its time since system boot, where a "tick" is a unit of time. Although the condensed vmstat output only showed four CPU states—us , sy , id , and wa —this shows how all the CPU ticks are distributed. In addition, we can see the total number of interrupts and context switches. One new addition is that of forks , which is basically the number of new processes that have been created since system boot.

Listing 2.5.

[ezolt@scrffy ~/edid]$ vmstat -s 1034320 total memory 998712 used memory 698076 active memory 176260 inactive memory 35608 free memory 26592 buffer memory 324312 swap cache 2040244 total swap 201060 used swap 1839184 free swap 5279633 non-nice user cpu ticks 28207739 nice user cpu ticks 2355391 system cpu ticks 628297350 idle cpu ticks 862755 IO-wait cpu ticks 34 IRQ cpu ticks 1707439 softirq cpu ticks 21194571 pages paged in 12677400 pages paged out 93406 pages swapped in 181587 pages swapped out 1931462143 interrupts 785963213 CPU context switches 1096643656 boot time 578451 forks

vmstat provides a broad range of information about the performance of a Linux system. It is one of the core tools to use when investigating a problem with a system.

2.2.2. top (v. 2.0.x)

top is the Swiss army knife of Linux system-monitoring tools. It does a good job of putting a very large amount of system-wide performance information in a single screen. What you display can also be changed interactively; so if a particular problem creeps up as your system runs, you can modify what top is showing you.

By default, top presents a list, in decreasing order, of the top CPU-consuming processes. This enables you to quickly pinpoint which program is hogging the CPU. Periodically, top updates the list based on a delay that you can specify. (It is initially 3 seconds.)

2.2.2.1 CPU Performance-Related Options

top is invoked with the following command line:

top [d delay] [C] [H] [i] [n iter] [b]

top actually takes options in two modes: command-line options and runtime options. The command-line options determine how top displays its information. Table 2-3 shows the command-line options that influence the type and frequency of the performance statistics that top displays.

Table 2-3. `top` Command-Line Options

Option	Explanation
`d delay`	Delay between statistic updates.
`n iterations`	Number of iterations before exiting. `top` updates the statistics`iterations` times.
`i`	Don't display processes that aren't using any of the CPU.
`H`	Show all the individual threads of an application rather than just display a total for each application.
`C`	In a hyperthreaded or SMP system, display the summed CPU statistics rather than the statistics for each CPU.

As you run top , you might want to fine-tune what you are observing to investigate a particular problem. The output of top is highly customizable. Table 2-4 describes options that change statistics shown during top 's runtime.

Table 2-4.

Option	Explanation
`f` or `F`	This displays a configuration screen that enables you to select which process statistics display on the screen.
`o` or `O`	This displays a configuration screen that enables you to change the order of the displayed statistics.

The options described in Table 2-5 turn on or off the display of various system-wide information. It can be helpful to turn off unneeded statistics to fit more processes on the screen.

Table 2-5. `top` Runtime Output Toggles

Option	Explanation
`l`	This toggles whether the load average and uptime information will be updated and displayed.
`t`	This toggles the display of how each CPU spends its time. It also toggles information about how many processes are currently running. Shows all the individual threads of an application instead of just displaying a total for each application.
`m`	This toggles whether information about the system memory usage will be shown on the screen. By default, the highest CPU consumers are displayed first. However, it might be more useful to sort by other characteristics.

Table 2-6 describes the different sorting modes that top supports. Sorting by memory consumption is particular useful to figure out which process consumes the most amount of memory.

Table 2-6. `top` Output Sorting/Display Options

Option	Explanation
`P`	Sorts the tasks by their CPU usage. The highest CPU user displays first.
`T`	Sorts the tasks by the amount of CPU time they have used so far. The highest amount displays first.
`N`	Sorts the tasks by their PID. The lowest PID displays first.
`A`	Sorts the tasks by their age. The newest PID is shown first. This is usually the opposite of "sort by PID."
`i`	Hides tasks that are idle and are not consuming CPU.

top provides system-wide information in addition to information about specific processes. Table 2-7 covers these statistics.

Table 2-7. `top` Performance Statistics

Option	Explanation
`us`	CPU time spent in user applications.
`sy`	CPU time spent in the kernel.
`ni`	CPU time spent in "nice"ed processes.
`id`	CPU time spent idle.
`wa`	CPU time spent waiting for I/O.
`hi`	CPU time spent in the `irq` handlers.
`si`	CPU time spent in the `softirq` handlers.
`load average`	The 1-minute, 5-minute, and 15-minute load average.
`%CPU`	The percentage of CPU that a particular process is consuming.
`PRI`	The priority value of the process, where a higher value indicates a higher priority. `RT` indicates that the task has real-time priority, a priority higher than the standard range.
`NI`	The nice value of the process. The higher the nice value, the less the system has to execute the process. Processes with high nice values tend to have very low priorities.
`WCHAN`	If a process is waiting on an I/O, this shows which kernel function it is waiting in.
`STAT`	This is the current status of a process, where the process is either sleeping ( `S` ), running ( `R` ), zombied ( killed but not yet dead) ( `Z` ), in an uninterruptable sleep ( `D` ), or being traced (`T` ).
`TIME`	The total amount CPU time (user and system) that this process has used since it started executing.
`COMMAND`	That command that this process is executing.
`LC`	The number of the last CPU that this process was executing on.
`FLAGS`	This toggles whether the load average and uptime information will be updated and displayed.

top provides a large amount of information about the different running processes and is a great way to figure out which process is a resource hog.

2.2.2.2 Example Usage

Listing 2.6 is an example run of top . Once it starts, it periodically updates the screen until you exit it. This demonstrates some of the system-wide statistics that top can generate. First, we see the load average of the system over the past 1, 5, and 15 minutes. As we can see, the system has started to get busy recently (because doom-3.x86 ). One CPU is busy with user code 90 percent of the time. The other is only spending ~13 percent of its time in user code. Finally, we can see that 73 of the processes are sleeping, and only 3 of them are currently running.

Listing 2.6.

catan> top 08:09:16 up 2 days, 18:44, 4 users, load average: 0.95, 0.44, 0.17 76 processes: 73 sleeping, 3 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 51.5% 0.0% 3.9% 0.0% 0.0% 0.0% 44.6% cpu00 90.0% 0.0% 1.2% 0.0% 0.0% 0.0% 8.8% cpu01 13.0% 0.0% 6.6% 0.0% 0.0% 0.0% 80.4% Mem: 2037140k av, 1132120k used, 905020k free, 0k shrd, 86220k buff 689784k active, 151528k inactive Swap: 2040244k av, 0k used, 2040244k free 322648k cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND 7642 root 25 0 647M 379M 7664 R 49.9 19.0 2:58 0 doom.x86 7661 ezolt 15 0 1372 1372 1052 R 0.1 0.0 0:00 1 top 1 root 15 0 528 528 452 S 0.0 0.0 0:05 1 init 2 root RT 0 0 0 0 SW 0.0 0.0 0:00 0 migration/0 3 root RT 0 0 0 0 SW 0.0 0.0 0:00 1 migration/1 4 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 keventd 5 root 34 19 0 0 0 SWN 0.0 0.0 0:00 0 ksoftirqd/0 6 root 34 19 0 0 0 SWN 0.0 0.0 0:00 1 ksoftirqd/1 9 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 bdflush 7 root 15 0 0 0 0 SW 0.0 0.0 0:00 0 kswapd 8 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 kscand 10 root 15 0 0 0 0 SW 0.0 0.0 0:00 1 kupdated 11 root 25 0 0 0 0 SW 0.0 0.0 0:00 0 mdrecoveryd

Now pressing F while top is running brings the configuration screen shown in Listing 2.7. When you press the keys indicated (A for PID, B for PPID, etc.), top toggles whether these statistics display in the previous screen. When all the desired statistics are selected, press Enter to return totop's initial screen, which now shows the current values of selected statistics. When configuring the statistics, all currently activated fields are capitalized in the Current Field Order line and have an asterisk (*) next to their name .

Listing 2.7.

[ezolt@wintermute doc]$ top (press 'F' while running) Current Field Order: AbcDgHIjklMnoTP|qrsuzyV{EFW[X Toggle fields with a-z, any other key to return: * A: PID = Process Id B: PPID = Parent Process Id C: UID = User Id * D: USER = User Name * E: %CPU = CPU Usage * F: %MEM = Memory Usage G: TTY = Controlling tty * H: PRI = Priority * I: NI = Nice Value J: PAGEIN = Page Fault Count K: TSIZE = Code Size (kb) L: DSIZE = Data+Stack Size (kb) * M: SIZE = Virtual Image Size (kb) N: TRS = Resident Text Size (kb) O: SWAP = Swapped kb * P: SHARE = Shared Pages (kb) Q: A = Accessed Page count R: WP = Write Protected Pages S: D = Dirty Pages * T: RSS = Resident Set Size (kb) U: WCHAN = Sleeping in Function * V: STAT = Process Status * W: TIME = CPU Time * X: COMMAND = Command Y: LC = Last used CPU (expect this to change regularly) Z: FLAGS = Task Flags (see linux/sched.h)

To show you how customizable top is, Listing 2.8 shows a highly configured output screen, which shows only the top options relevant to CPU usage.

Listing 2.8.

08:16:23 up 2 days, 18:52, 4 users, load average: 1.07, 0.92, 0.49 76 processes: 73 sleeping, 3 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 48.2% 0.0% 1.5% 0.0% 0.0% 0.0% 50.1% cpu00 0.3% 0.0% 0.1% 0.0% 0.0% 0.0% 99.5% cpu01 96.2% 0.0% 2.9% 0.0% 0.0% 0.0% 0.7% Mem: 2037140k av, 1133548k used, 903592k free, 0k shrd, 86232k buff 690812k active, 151536k inactive Swap: 2040244k av, 0k used, 2040244k free 322656k cached PID USER PRI NI WCHAN FLAGS LC STAT %CPU TIME CPU COMMAND 7642 root 25 0 100100 1 R 49.6 10:30 1 doom.x86 1 root 15 0 400100 0 S 0.0 0:05 0 init 2 root RT 0 140 0 SW 0.0 0:00 0 migration/0 3 root RT 0 140 1 SW 0.0 0:00 1 migration/1 4 root 15 0 40 0 SW 0.0 0:00 0 keventd 5 root 34 19 40 0 SWN 0.0 0:00 0 ksoftirqd/0 6 root 34 19 40 1 SWN 0.0 0:00 1 ksoftirqd/1 9 root 25 0 40 0 SW 0.0 0:00 0 bdflush 7 root 15 0 840 0 SW 0.0 0:00 0 kswapd 8 root 15 0 40 0 SW 0.0 0:00 0 kscand 10 root 15 0 40 0 SW 0.0 0:00 0 kupdated 11 root 25 0 40 0 SW 0.0 0:00 0 mdrecoveryd 20 root 15 0 400040 0 SW 0.0 0:00 0 katad-1

top provides an overview of system resource usage with a focus on providing information about how various processes are consuming those resources. It is best used when interacting with the system directly because of the user-friendly and tool-unfriendly format of its output.

2.2.3. top (v. 3.x.x)

Recently, the version of top provided by the most recent distributions has been completely overhauled, and as a result, many of the command-line and interaction options have changed. Although the basic ideas are similar, it has been streamlined, and a few different display modes have been added.

Again, top presents a list, in decreasing order, of the top CPU-consuming processes.

2.2.3.1 CPU Performance-Related Options

top is invoked with the following command line:

top [-d delay] [-n iter] [-i] [-b]

top actually takes options in two modes: command-line options and runtime options. The command-line options determine how top displays its information. Table 2-8 shows the command-line options that influence the type and frequency of the performance statistics that top will display.

Table 2-8. `top` Command-Line Options

Option	Explanation
`-d delay`	Delay between statistic updates.
`-n iterations`	Number of iterations before exiting. `top` updates the statistics'`iterations` times.
`-i`	This option changes whether or not idle processes display.
`-b`	Run in batch mode. Typically, `top` shows only a single screenful of information, and processes that don't fit on the screen never display. This option shows all the processes and can be very useful if you are saving `top` 's output to a file or piping the output to another command for processing.

As you run top , you may want to fine-tune what you are observing to investigate a particular problem. Like the 2.x version of top , the output oftop is highly customizable. Table 2-9 describes options that change statistics shown during top 's runtime.

Table 2-9. `top` Runtime Options

Option	Explanation
`A`	This displays an "alternate" display of process information that shows top consumers of various system resources.
`I`	This toggles whether `top` will divide the CPU usage by the number of CPUs on the system.
	For example, if a process was consuming all of both CPUs on a two-CPU system, this toggles whether `top` displays a CPU usage of 100% or 200%.
`f`	This displays a configuration screen that enables you to select which process statistics display on the screen.
`o`	This displays a configuration screen that enables you to change the order of the displayed statistics.

The options described in Table 2-10 turn on or off the display of various system-wide information. It can be helpful to turn off unneeded statistics to fit more processes on the screen.

Table 2-10. `top` Runtime Output Toggles

Option	Explanation
`1 (numeral 1)`	This toggles whether the CPU usage will be broken down to the individual usage or shown as a total.
`l`	This toggles whether the load average and uptime information will be updated and displayed.

top v3.x provides system-wide information in addition to information about specific processes similar to those of top v2.x . These statistics are covered in Table 2-11.

Table 2-11. `top` Performance Statistics

Option	Explanation
`us`	CPU time spent in user applications.
`sy`	CPU time spent in the kernel.
`ni`	CPU time spent in "nice"ed processes.
`id`	CPU time spent idle.
`wa`	CPU time spent waiting for I/O.
`hi`	CPU time spent in the `irq` handlers.
`si`	CPU time spent in the `softirq` handlers.
`load average`	The 1-minute, 5-minute, and 15-minute load average.
`%CPU`	The percentage of CPU that a particular process is consuming.
`PRI`	The priority value of the process, where a higher value indicates a higher priority. `RT` indicates that the task has real-time priority, a priority higher than the standard range.
`NI`	The nice value of the process. The higher the nice value, the less the system has to execute the process. Processes with high nice values tend to have very low priorities.
`WCHAN`	If a process is waiting on an I/O, this shows which kernel function it is waiting in.
`TIME`	The total amount CPU time (user and system) that this process has used since it started executing.
`COMMAND`	That command that this process is executing.
`S`	This is the current status of a process, where the process is either sleeping ( `S` ), running ( `R` ), zombied (killed but not yet dead) ( `Z` ), in an uninterruptable sleep ( `D` ), or being traced (`T` ).

top provides a large amount of information about the different running processes and is a great way to figure out which process is a resource hog. The v.3 version of top has trimmed -down top and added some alternative views of similar data.

2.2.3.2 Example Usage

Listing 2.9 is an example run of top v3.0. Again, it will periodically update the screen until you exit it. The statistics are similar to those of top v2.x, but are named slightly differently.

Listing 2.9.

catan> top top - 08:52:21 up 19 days, 21:38, 17 users, load average: 1.06, 1.13, 1.15 Tasks: 149 total, 1 running, 146 sleeping, 1 stopped, 1 zombie Cpu(s): 0.8% us, 0.4% sy, 4.2% ni, 94.2% id, 0.1% wa, 0.0% hi, 0.3% si Mem: 1034320k total, 1023188k used, 11132k free, 39920k buffers Swap: 2040244k total, 214496k used, 1825748k free, 335488k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 26364 root 16 0 400m 68m 321m S 3.8 6.8 379:32.04 X 26737 ezolt 15 0 71288 45m 21m S 1.9 4.5 6:32.04 gnome-terminal 29114 ezolt 15 0 34000 22m 18m S 1.9 2.2 27:57.62 gnome-system-mo 9581 ezolt 15 0 2808 1028 1784 R 1.9 0.1 0:00.03 top 1 root 16 0 2396 448 1316 S 0.0 0.0 0:01.68 init 2 root RT 0 0 0 0 S 0.0 0.0 0:00.68 migration/0 3 root 34 19 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/0 4 root RT 0 0 0 0 S 0.0 0.0 0:00.27 migration/1 5 root 34 19 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/1 6 root RT 0 0 0 0 S 0.0 0.0 0:22.49 migration/2 7 root 34 19 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/2 8 root RT 0 0 0 0 S 0.0 0.0 0:37.53 migration/3 9 root 34 19 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/3 10 root 5 -10 0 0 0 S 0.0 0.0 0:01.74 events/0 11 root 5 -10 0 0 0 S 0.0 0.0 0:02.77 events/1 12 root 5 -10 0 0 0 S 0.0 0.0 0:01.79 events/2

Now pressing f while top is running brings the configuration screen shown in Listing 2.10. When you press the keys indicated (A for PID, B for PPID, etc.), top toggles whether these statistics display in the previous screen. When all the desired statistics are selected, press Enter to return totop 's initial screen, which now shows the current values of selected statistics. When you are configuring the statistics, all currently activated fields are capitalized in the Current Field Order line and have and asterisk (*) next to their name. Notice that most of statistics are similar, but the names have slightly changed.

Listing 2.10.

(press 'f' while running) Current Fields: AEHIOQTWKNMbcdfgjplrsuvyzX for window 1:Def Toggle fields via field letter, type any other key to return * A: PID = Process Id u: nFLT = Page Fault count * E: USER = User Name v: nDRT = Dirty Pages count * H: PR = Priority y: WCHAN = Sleeping in Function * I: NI = Nice value z: Flags = Task Flags <sched.h> * O: VIRT = Virtual Image (kb) * X: COMMAND = Command name/line * Q: RES = Resident size (kb) * T: SHR = Shared Mem size (kb) Flags field: * W: S = Process Status 0x00000001 PF_ALIGNWARN * K: %CPU = CPU usage 0x00000002 PF_STARTING * N: %MEM = Memory usage (RES) 0x00000004 PF_EXITING * M: TIME+ = CPU Time, hundredths 0x00000040 PF_FORKNOEXEC b: PPID = Parent Process Pid 0x00000100 PF_SUPERPRIV c: RUSER = Real user name 0x00000200 PF_DUMPCORE d: UID = User Id 0x00000400 PF_SIGNALED f: GROUP = Group Name 0x00000800 PF_MEMALLOC g: TTY = Controlling Tty 0x00002000 PF_FREE_PAGES (2.5) j: #C = Last used cpu (SMP) 0x00008000 debug flag (2.5) p: SWAP = Swapped size (kb) 0x00024000 special threads (2.5) l: TIME = CPU Time 0x001D0000 special states (2.5) r: CODE = Code size (kb) 0x00100000 PF_USEDFPU (thru 2.4) s: DATA = Data+Stack size (kb)

Listing 2.11 shows the new output mode of top , where many different statistics are sorted and displayed on the same screen.

Listing 2.11.

(press 'F' while running) 1:Def - 09:00:48 up 19 days, 21:46, 17 users, load average: 1.01, 1.06, 1.10 Tasks: 144 total, 1 running, 141 sleeping, 1 stopped, 1 zombie Cpu(s): 1.2% us, 0.9% sy, 0.0% ni, 97.9% id, 0.0% wa, 0.0% hi, 0.0% si Mem: 1034320k total, 1024020k used, 10300k free, 39408k buffers Swap: 2040244k total, 214496k used, 1825748k free, 335764k cached 1 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29114 ezolt 16 0 34112 22m 18m S 3.6 2.2 28:15.06 gnome-system-mo 26364 root 15 0 400m 68m 321m S 2.6 6.8 380:01.09 X 9689 ezolt 16 0 3104 1092 1784 R 1.0 0.1 0:00.09 top 2 PID PPID TIME+ %CPU %MEM PR NI S VIRT SWAP RES UID COMMAND 30403 24989 0:00.03 0.0 0.1 15 0 S 5808 4356 1452 9336 bash 29510 29505 7:19.59 0.0 5.9 16 0 S 125m 65m 59m 9336 firefox-bin 29505 29488 0:00.00 0.0 0.1 16 0 S 5652 4576 1076 9336 run-mozilla.sh 3 PID %MEM VIRT SWAP RES CODE DATA SHR nFLT nDRT S PR NI %CPU COMMAND 8414 25.0 374m 121m 252m 496 373m 98m 1547 0 S 16 0 0.0 soffice.bin 26364 6.8 400m 331m 68m 1696 398m 321m 2399 0 S 15 0 2.6 X 29510 5.9 125m 65m 59m 64 125m 31m 253 0 S 16 0 0.0 firefox-bin 26429 4.7 59760 10m 47m 404 57m 12m 1247 0 S 15 0 0.0 metacity 4 PID PPID UID USER RUSER TTY TIME+ %CPU %MEM S COMMAND 1371 1 43 xfs xfs ? 0:00.10 0.0 0.1 S xfs 1313 1 51 smmsp smmsp ? 0:00.08 0.0 0.2 S sendmail 982 1 29 rpcuser rpcuser ? 0:00.07 0.0 0.1 S rpc.statd 963 1 32 rpc rpc ? 0:06.23 0.0 0.1 S portmap

top v3.x provides a slightly cleaner interface to top . It simplifies some aspects of it and provides a nice "summary" information screen that displays many of the resource consumers in the system.

2.2.4. procinfo (Display Info from the /proc File System)

Much like vmstat , procinfo provides a view of the system-wide performance characteristics. Although some of the information that it provides is similar to that of vmstat , it also provides information about the number of interrupts that the CPU received for each device. Its output format is a little more readable than vmstat , but it takes up much more screen space.

2.2.4.1 CPU Performance-Related Options

procinfo is invoked with the following command:

procinfo [-f] [-d] [-D] [-n sec] [-f file]

Table 2-12 describes the different options that change the output and the frequency of the samples that procinfo displays.

Table 2-12. `procinfo` Command-Line Options

Option	Explanation
`-f`	Runs `procinfo` in full-screen mode
`-d`	Displays statistics change between samples rather than totals
`-D`	Displays statistic totals rather than rate of change
`-n sec`	Number of seconds to pause between each sample
`-Ffile`	Sends the output of `procinfo` to a file

Table 2-13 shows the CPU statistics that procinfo gathers.

Table 2-13. `procinfo` CPU Statistics

Option	Explanation
`user`	This is the amount of user time that the CPU has spent in days, hours, and minutes.
`nice`	This is the amount of nice time that the CPU has spent in days, hours, and minutes.
`system`	This is the amount of system time that the CPU has spent in days, hours, and minutes.
`idle`	This is the amount of idle time that the CPU has spent in days, hours, and minutes.
`irq 0- N`	This displays the number of the `irq` , the amount that has fired , and which kernel driver is responsible for it.

Much like vmstat or top , procinfo is a low-overhead command that is good to leave running in a console or window on the screen. It gives a good indication of a system's health and performance.

2.2.4.2 Example Usage

Calling procinfo without any command options yields output similar to Listing 2.12. Without any options, procinfo displays only one screenful of status and then exits. procinfo is more useful when it is periodically updated using the -n second options. This enables you to see how the system's performance is changing in real time.

Listing 2.12.

[ezolt@scrffy ~/mail]$ procinfo Linux 2.4.18-3bigmem (bhcompile@daffy) (gcc 2.96 20000731 ) #1 4CPU [scrffy] Memory: Total Used Free Shared Buffers Cached Mem: 1030784 987776 43008 0 35996 517504 Swap: 2040244 17480 2022764 Bootup: Thu Jun 3 09:20:22 2004 Load average: 0.47 0.32 0.26 1/118 10378 user : 3:18:53.99 2.7% page in : 1994292 disk 1: 20r 0w nice : 0:00:22.91 0.0% page out: 2437543 disk 2: 247231r 131696w system: 3:45:41.20 3.1% swap in : 996 idle : 4d 15:56:17.10 94.0% swap out: 4374 uptime: 1d 5:45:18.80 context : 64608366 irq 0: 10711880 timer irq 12: 1319185 PS/2 Mouse irq 1: 94931 keyboard irq 14: 7144432 ide0 irq 2: 0 cascade [4] irq 16: 16 aic7xxx irq 3: 1 irq 18: 4152504 nvidia irq 4: 1 irq 19: 0 usb-uhci irq 6: 2 irq 20: 4772275 es1371 irq 7: 1 irq 22: 384919 aic7xxx irq 8: 1 rtc irq 23: 3797246 usb-uhci, eth0

As you can see from Listing 2.12, procinfo provides a reasonable overview of the system. We can see that, once again for the user, nice, system, and idle time, the system is not very busy. One interesting thing to notice is that procinfo claims that the system has spent more idle time than the system has been running (as indicated by the uptime). This is because the system actually has four CPUs, so for every day of wall time, four days of CPU time passes . The load average confirms that the system has been relatively work-free for the recent past. For the past minute, on the average, the system had less than one process ready to run; a load average of .47 indicates that a single process was ready to run only 47 percent of the time. On a four-CPU system, this large amount of CPU power is going to waste.

procinfo also gives us a good view of what devices on the system are causing interrupts. We can see that the Nvidia card ( nvidia ), IDE controller ( ide0 ), Ethernet device ( eth0 ), and sound card ( es1371 ) have a relatively high number of interrupts. This is as one would expect for a desktop workstation.

procinfo has the advantage of putting many of the system-wide performance statistics within a single screen, enabling you to see how the system is performing as a whole. It lacks details about network and disk performance, but it provide a good system-wide detail of the CPU and memory performance. One limitation that can be significant is the fact that procinfo does not report when the CPU is in the iowait , irq , orsoftirq mode.

2.2.5. gnome-system-monitor

gnome-system-monitor is, in many ways, a graphical counterpart of top . It enables you to graphically monitor individual processes and observe the load on the system based on the graphs that it displays.

2.2.5.1 CPU Performance-Related Options

gnome-system-monitor can be invoked from the Gnome menu. (Under Red Hat 9 and greater, this is under System Tools > System Monitor.) However, it can also be invoked using the following command:

gnome-system-monitor

gnome-system-monitor has no relevant command-line options that affect the CPU performance measurements. However, some of the statistics shown can be modified by selecting gnome-system-monitor's Edit > Preferences menu entry.

2.2.5.2 Example Usage

When you launch gnome-system-monitor , it creates a window similar to Figure 2-1. This window shows information about the amount of CPU and memory that a particular process is using. It also shows information about the parent/child relationships between each process.

Figure 2-1.

Figure 2-2 shows a graphical view of system load and memory usage. This is really what distinguishes gnome-system-monitor from top . You can easily see the current state of the system and how it compares to the previous state.

Figure 2-2.

The graphical view of data provided by gnome-system-monitor can make it easier and faster to determine the state of the system, and how its behavior changes over time. It also makes it easier to navigate the system-wide process information.

2.2.6. mpstat (Multiprocessor Stat)

mpstat is a fairly simple command that shows you how your processors are behaving based on time. The biggest benefit of mpstat is that it shows the time next to the statistics, so you can look for a correlation between CPU usage and time of day.

If you have multiple CPUs or hyperthreading-enabled CPUs, mpstat can also break down CPU usage based on the processor, so you can see whether a particular processor is doing more work than the others. You can select which individual processor you want to monitor or you can askmpstat to monitor all of them.

2.2.6.1 CPU Performance-Related Options

mpstat can be invoked using the following command line:

mpstat [ -P { cpu | ALL } ] [ delay [ count ] ]

Once again, delay specifies how often the samples will be taken, and count determines how many times it will be run. Table 2-14 describes the command-line options of mpstat .

Table 2-14. `mpstat` Command-Line Options

Option	Explanation
`-P { cpu \| ALL` }	This option tells `mpstat` which CPUs to monitor. `cpu` is the number between 0 and the total CPUs minus 1.
`delay`	This specifies how long `mpstat` waits between samples.

mpstat provides similar information to the other CPU performance tools, but it allows the information to be attributed to each of the processors in a particular system. Table 2-15 describes the options that it supports.

Table 2-15. `mpstat` CPU Statistics

Option	Explanation
`user`	This is the percentage of user time that the CPU has spent during the previous sample.
`nice`	This is the percentage of time that the CPU has spent during the previous sample running low-priority (or nice) processes.
`system`	This is the percentage of system time that the CPU has spent during the previous sample.
`iowait`	This is the percentage of time that the CPU has spent during the previous sample waiting on I/O.
`irq`	This is the percentage of time that the CPU has spent during the previous sample handling interrupts.
`softirq`	This is the percentage of time that the CPU has spent during the previous sample handling work that needed to be done by the kernel after an interrupt has been handled.
`idle`	This is the percentage of time that the CPU has spent idle during the previous sample.

mpstat is a good tool for providing a breakdown of how each of the processors is performing. Because mpstat provides a per-CPU breakdown, you can identify whether one of the processors is becoming overloaded.

2.2.6.2 Example Usage

First, we ask mpstat to show us the CPU statistics for processor number 0. This is shown in Listing 2.13.

Listing 2.13.

[ezolt@scrffy sysstat-5.1.1]$ ./mpstat -P 0 1 10 Linux 2.6.8-1.521smp (scrffy) 10/20/2004 07:12:02 PM CPU %user %nice %sys %iowait %irq %soft %idle intr/s 07:12:03 PM 0 9.80 0.00 1.96 0.98 0.00 0.00 87.25 1217.65 07:12:04 PM 0 1.01 0.00 0.00 0.00 0.00 0.00 98.99 1112.12 07:12:05 PM 0 0.99 0.00 0.00 0.00 0.00 0.00 99.01 1055.45 07:12:06 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1072.00 07:12:07 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1075.76 07:12:08 PM 0 1.00 0.00 0.00 0.00 0.00 0.00 99.00 1067.00 07:12:09 PM 0 4.90 0.00 3.92 0.00 0.00 0.98 90.20 1045.10 07:12:10 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1069.70 07:12:11 PM 0 0.99 0.00 0.99 0.00 0.00 0.00 98.02 1070.30 07:12:12 PM 0 3.00 0.00 4.00 0.00 0.00 0.00 93.00 1067.00 Average: 0 2.19 0.00 1.10 0.10 0.00 0.10 96.51 1085.34

Listing 2.14 shows a similar command on very unloaded CPUs that both have hyperthreading. You can see how the stats for all the CPUs are shown. One interesting observation in this output is the fact that one CPU seems to handle all the interrupts. If the system was heavy loaded with I/O, and all the interrupts were being handed by a single processor, this could be the cause of a bottleneck, because one CPU is overwhelmed, and the rest are waiting for work to do. You would be able to see this with mpstat, if the processor handling all the interrupts had no idle time, whereas the other processors did.

Listing 2.14.

[ezolt@scrffy sysstat-5.1.1]$ ./mpstat -P ALL 1 2 Linux 2.6.8-1.521smp (scrffy) 10/20/2004 07:13:21 PM CPU %user %nice %sys %iowait %irq %soft %idle intr/s 07:13:22 PM all 3.98 0.00 1.00 0.25 0.00 0.00 94.78 1322.00 07:13:22 PM 0 2.00 0.00 0.00 1.00 0.00 0.00 97.00 1137.00 07:13:22 PM 1 6.00 0.00 2.00 0.00 0.00 0.00 93.00 185.00 07:13:22 PM 2 1.00 0.00 0.00 0.00 0.00 0.00 99.00 0.00 07:13:22 PM 3 8.00 0.00 1.00 0.00 0.00 0.00 91.00 0.00 07:13:22 PM CPU %user %nice %sys %iowait %irq %soft %idle intr/s 07:13:23 PM all 2.00 0.00 0.50 0.00 0.00 0.00 97.50 1352.53 07:13:23 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1135.35 07:13:23 PM 1 6.06 0.00 2.02 0.00 0.00 0.00 92.93 193.94 07:13:23 PM 2 0.00 0.00 0.00 0.00 0.00 0.00 101.01 16.16 07:13:23 PM 3 1.01 0.00 1.01 0.00 0.00 0.00 100.00 7.07 Average: CPU %user %nice %sys %iowait %irq %soft %idle intr/s Average: all 2.99 0.00 0.75 0.12 0.00 0.00 96.13 1337.19 Average: 0 1.01 0.00 0.00 0.50 0.00 0.00 98.49 1136.18 Average: 1 6.03 0.00 2.01 0.00 0.00 0.00 92.96 189.45 Average: 2 0.50 0.00 0.00 0.00 0.00 0.00 100.00 8.04 Average: 3 4.52 0.00 1.01 0.00 0.00 0.00 95.48 3.52

mpstat can be used to determine whether the CPUs are fully utilized and relatively balanced. By observing the number of interrupts each CPU is handling, it is possible to find an imbalance. Details on how to control where interrupts are routing are provided in the kernel source underDocumentation/IRQ-affinity.txt .

2.2.7. sar (System Activity Reporter)

sar has yet another approach to collecting system data. sar can efficiently record system performance data collected into binary files that can be replayed at a later date. sar is a low-overhead way to record information about how the system is performing.

The sar command can be used to record performance information, replay previous recorded information, and display real-time information about the current system. The output of the sar command can be formatted to make it easy to pipe to relational databases or to other Linux commands for processing.

2.2.7.1 CPU Performance-Related Options

sar can be invoked with the following command line:

sar [options] [ delay [ count ] ]

Although sar reports about many different areas of Linux, the statistics are of two different forms. One set of statistics is the instantaneous value at the time of the sample. The other is a rate since the last sample. Table 2-16 describes the command-line options of sar .

Table 2-16. `sar` Command-Line Options

Option	Explanation
`-c`	This reports information about how many processes are being created per second.
`-I {irq \| SUM \| ALL \| XALL` }	This reports the rates that interrupts have been occurring in the system.
`-P {cpu \| ALL` }	This option specifies which CPU the statistics should be gathered from. If this isn't specified, the system totals are reported.
`-q`	This reports information about the run queues and load averages of the machine.
`-u`	This reports information about CPU utilization of the system. (This is the default output.)
`-w`	This reports the number of context switches that occurred in the system.
`-o filename`	This specifies the name of the binary output file that will store the performance statistics.
`-f filename`	This specifies the filename of the performance statistics.
`delay`	The amount of time to wait between samples.
`count`	The total number of samples to record.

sar offers a similar set (with different names) of the system-wide CPU performance statistics that we have seen in the proceeding tools. The list is shown in Table 2-17.

Table 2-17. `sar` CPU Statistics

Option	Explanation
`user`	This is the percentage of user time that the CPU has spent during the previous sample.
`nice`	This is the percentage of time that the CPU has spent during the previous sample running low-priority (or nice) processes.
`system`	This is the percentage of system time that the CPU has spent during the previous sample.
`iowait`	This is the percentage of time that the CPU has spent during the previous sample waiting on I/O.
`idle`	This is the percentage of time that the CPU was idle during the previous sample.
`runq-sz`	This is the size of the run queue when the sample was taken.
`plist -sz`	This is the number of processes present (running, sleeping, or waiting for I/O) when the sample was taken.
`ldavg-1`	This was the load average for the last minute.
`ldavg-5`	This was the load average for the past 5 minutes.
`ldavg-15`	This was the load average for the past 15 minutes.
`proc/s`	This is the number of new processes created per second. (This is the same as the `forks` statistic from `vmstat` .)
`cswch`	This is the number of context switches per second.
`intr/s`	The number of interrupts fired per second.

One of the significant benefits of sar is that it enables you to save many different types of time-stamped system data to log files for later retrieval and review. This can prove very handy when trying to figure out why a particular machine is failing at a particular time.

2.2.7.2 Example Usage

This first command shown in Listing 2.15 takes three samples of the CPU every second, and stores the results in the binary file/tmp/apache_test . This command does not have any visual output and just returns when it has completed.

Listing 2.15.

[ezolt@wintermute sysstat-5.0.2]$ sar -o /tmp/apache_test 1 3

After the information has been stored in the /tmp/apache_test file, we can display it in various formats. The default is human readable. This is shown in Listing 2.16. This shows similar information to the other system monitoring commands, where we can see how the processor was spending time at a particular time.

Listing 2.16.

[ezolt@wintermute sysstat-5.0.2]$ sar -f /tmp/apache_test Linux 2.4.22-1.2149.nptl (wintermute.phil.org) 03/20/04 17:18:34 CPU %user %nice %system %iowait %idle 17:18:35 all 90.00 0.00 10.00 0.00 0.00 17:18:36 all 95.00 0.00 5.00 0.00 0.00 17:18:37 all 92.00 0.00 6.00 0.00 2.00 Average: all 92.33 0.00 7.00 0.00 0.67

However, sar can also output the statistics in a format that can be easily imported into a relational database, as shown in Listing 2.17. This can be useful for storing a large amount of performance data. Once it has been imported into a relational database, the performance data can be analyzedwith all of the tools of a relational database.

Listing 2.17.

[ezolt@wintermute sysstat-5.0.2]$ sar -f /tmp/apache_test -H wintermute.phil.org;1;2004-03-20 22:18:35 UTC;-1;90.00;0.00;10.00;0.00;0.00 wintermute.phil.org;1;2004-03-20 22:18:36 UTC;-1;95.00;0.00;5.00;0.00;0.00 wintermute.phil.org;1;2004-03-20 22:18:37 UTC;-1;92.00;0.00;6.00;0.00;2.00

Finally, sar can also output the statistics in a format that can be easily parsed by standard Linux tools such as awk, perl, python , orgrep . This output, which is shown in Listing 2.18, can be fed into a script that will pull out interesting events, and possibly even analyze different trends in the data.

Listing 2.18.

[ezolt@wintermute sysstat-5.0.2]$ sar -f /tmp/apache_test -h wintermute.phil.org 1 1079821115 all %user 90.00 wintermute.phil.org 1 1079821115 all %nice 0.00 wintermute.phil.org 1 1079821115 all %system 10.00 wintermute.phil.org 1 1079821115 all %iowait 0.00 wintermute.phil.org 1 1079821115 all %idle 0.00 wintermute.phil.org 1 1079821116 all %user 95.00 wintermute.phil.org 1 1079821116 all %nice 0.00 wintermute.phil.org 1 1079821116 all %system 5.00 wintermute.phil.org 1 1079821116 all %iowait 0.00 wintermute.phil.org 1 1079821116 all %idle 0.00 wintermute.phil.org 1 1079821117 all %user 92.00 wintermute.phil.org 1 1079821117 all %nice 0.00 wintermute.phil.org 1 1079821117 all %system 6.00 wintermute.phil.org 1 1079821117 all %iowait 0.00 wintermute.phil.org 1 1079821117 all %idle 2.00

In addition to recording information in a file, sar can also be used to observe a system in real time. In the example shown in Listing 2.19, the CPU state is sampled three times with one second between them.

Listing 2.19.

[ezolt@wintermute sysstat-5.0.2]$ sar 1 3 Linux 2.4.22-1.2149.nptl (wintermute.phil.org) 03/20/04 17:27:10 CPU %user %nice %system %iowait %idle 17:27:11 all 96.00 0.00 4.00 0.00 0.00 17:27:12 all 98.00 0.00 2.00 0.00 0.00 17:27:13 all 92.00 0.00 8.00 0.00 0.00 Average: all 95.33 0.00 4.67 0.00 0.00

The default display's purpose is to show information about the CPU, but other information can also be displayed. For example, sar can show the number of context switches per second, and the number of memory pages that have been swapped in or out. In Listing 2.20, sar samples the information two times, with one second between them. In this case, we ask sar to show us the total number of context switches and process creations that occur every second. We also ask sar for information about the load average. We can see in this example that this machine has 163 process that are in memory but not running. For the past minute, on average 1.12 processes have been ready to run.

Listing 2.20.

[ezolt@scrffy manuscript]$ sar -w -c -q 1 2 Linux 2.6.8-1.521smp (scrffy) 10/20/2004 08:23:29 PM proc/s 08:23:30 PM 0.00 08:23:29 PM cswch/s 08:23:30 PM 594.00 08:23:29 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 08:23:30 PM 0 163 1.12 1.17 1.17 08:23:30 PM proc/s 08:23:31 PM 0.00 08:23:30 PM cswch/s 08:23:31 PM 812.87 08:23:30 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 08:23:31 PM 0 163 1.12 1.17 1.17 Average: proc/s Average: 0.00 Average: cswch/s Average: 703.98 Average: runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 Average: 0 163 1.12 1.17 1.17

As you can see, sar is a powerful tool that can record many different performance statistics. It provides a Linux-friendly interface that enables you to easily extract and analyze the performance data.

2.2.8. oprofile

oprofile is a performance suite that uses the performance counter hardware available in nearly all modern processors to track where CPU time is being spent on an entire system, and individual processes. In addition to measuring where CPU cycles are spent, oprofile can measure very low-level information about how the CPU is performing. Depending on the events supported by the underlying processor, it can measure such things as cache misses, branch mispredictions and memory references, and floating-point operations.

oprofile does not record every event that occurs; instead, it works with the processor's performance hardware to sample every count events, where count is a value that users specify when they start oprofile . The lower the value of count , the more accurate the results are, but the higher the overhead of oprofile . By keeping count to a reasonable value, oprofile can run with a very low overhead but still give an amazingly accurate account of the performance of the system.

Sampling is very powerful, but be careful for some nonobvious gotchas when using it. First, sampling may say that you are spending 90 percent of your time in a particular routine, but it does not say why. There can be two possible causes for a high number of cycles attributed to a particular routine. First, it is possible that this routine is the bottleneck and is taking a long amount of time to execute. However, it may also be that the function is taking a reasonable amount of time to execute, but is called a large number of times. You can usually figure out which is the case by looking at the samples around, the particularly hot line, or by instrumenting the code to count the number of calls that are made to it.

The second problem of sampling is that you are never quite sure where a function is being called from. Even if you figure out that the function is being called many times and you track down all of the functions that call it, it is not necessarily clear which function is doing the majority of the calling.

2.2.8.1 CPU Performance-Related Options

oprofile is actually a suite of pieces that work together to collect CPU performance statistics. There are three main pieces of oprofile :

The oprofile kernel module manipulates the processor and turns on and off sampling.
The oprofile daemon collects the samples and saves them to disk.
The oprofile reporting tools take the collected samples and show the user how they relate to the applications running on the system.

The oprofile suite hides the driver and daemon manipulation in the opcontrol command. The opcontrol command is used to select which events the processor will sample and start the sampling.

When controlling the daemon, you can invoke opcontrol using the following command line:

opcontrol [--start] [--stop] [--dump]

This option's control, the profiling daemon, enables you to start and stop sampling and to dump the samples from the daemon's memory to disk. When sampling, the oprofile daemon stores a large amount of samples in internal buffers. However, it is only possibly to analyze the samples that have been written (or dumped) to disk. Writing to disk can be an expensive operation, so oprofile only does it periodically. As a result, after running a test and profiling with oprofile , the results may not be available immediately, and you will have to wait until the daemon flushes the buffers to disk. This can be very annoying when you want to begin analysis immediately, so the opcontrol command enables you to force the dump of samples from the oprofile daemon's internal buffers to disk. This enables you to begin a performance investigation immediately after a test has completed.

Table 2-18 describes the command-line options for the opcontrol program that enable you to control the operation of the daemon.

Table 2-18. `opcontrol` Daemon Control

Option	Explanation
`-s/--start`	Starts profiling unless this uses a default event for the current processor
`-d/--dump`	Dumps the sampling information that is currently in the kernel sample buffers to the disk.
`--stop`	This will stop the profiling.

By default, oprofile picks an event with a given frequency that is reasonable for the processor and kernel that you are running it on. However, it has many more events that can be monitored than the default. When you are listing and selecting an event, opcontrol is invoked using the following command line:

opcontrol [--list-events] [-event=:name:count:unitmask:kernel:user:]

The event specifier enables you to select which event is going to be sampled; how frequently it will be sampled; and whether that sampling will take place in kernel space, user space, or both. Table 2-19 describes the command-line option of opcontrol that enables you to select different events to sample.

Table 2-19. `opcontrol` Event Handling

Option	Explanation
`-l/--list-events`	Lists the different events that the processor can sample.
`-event=:name:count: unitmask:kernel:user:`	Used to specify what events will be sampled. The event name must be one of the events that the processor supports. A valid event can be retrieved from the `--list- events` option. The `count` parameter specifies that the processor will be sampled every `count` times that event happens. The `unitmask` modifies what the event is going to sample. For example, if you are sampling "reads from memory," the unit mask may allow you to select only those reads that didn't hit in the cache. The `kernel` parameter specifies whether `oprofile` should sample when the processor is running in kernel space. The `user`parameter specifies whether `oprofile` should sample when the processor is running in user space.
`--vmlinux = kernel`	Specifies which uncompressed kernel image `oprofile` will use to attribute samples to various kernel functions.

After the samples have been collected and saved to disk, oprofile provides a different tool, opreport , which enables you to view the samples that have been collected. opreport is invoked using the following command line:

opreport [-r] [-t]

Typically, opreport displays all the samples collected by the system and which executables (including the kernel) are responsible for them. The executables with the highest number of samples are shown first, and are followed by all the executables with samples. In a typical system, most of the samples are in a handful of executables at the top of the list, with a very large number of executables contributing a very small number of samples. To deal with this, opreport enables you to set a threshold, and only executables with that percentage of the total samples or greater will be shown. Alternatively, opreport can reverse the order of the executables that are shown, so those with a high contribution are shown last. This way, the most important data is printed last, and it will not scroll off the screen.

Table 2-20 describes these command-line options of opreport that enable you to format the output of the sampling.

Table 2-20. `opreport` Report Format

Option	Explanation
`--reverse-sort / -r`	Reverses the order of the sort. Typically, the images that caused the most events display first.
`--threshold / -t [percentage]`	Causes `opreport` to only show images that havecontributed `percentage` or more amount of samples. This can be useful when there are many images with a very small number of samples and you are only interested in the most significant.

Again, oprofile is a complicated tool, and these options show only the basics of what oprofile can do. You learn more about the capabilities of oprofile in later chapters.

2.2.8.2 Example Usage

oprofile is a very powerful tool, but it can also be difficult to install. Appendix B, "Installing oprofile," contains instructions on how to getoprofile installed and running on a few of the major Linux distributions.

We begin the use of oprofile by setting it up for profiling. This first command, shown in Listing 2.21, uses the opcontrol command to tell the oprofile suite where an uncompressed image of the kernel is located. oprofile needs to know the location of this file so that it can attribute samples to exact functions within the kernel.

Listing 2.21.

[root@wintermute root]# opcontrol --vmlinux=/boot/vmlinux-\ 2.4.22-1.2174.nptlsmp

After we set up the path to the current kernel, we can begin profiling. The command in Listing 2.22 tells oprofile to start sampling using the default event. This event varies depending on the processor, but the default event for this processor is CPU_CLK_UNHALTED . This event samples all of the CPU cycles where the processor is not halted. The 233869 means that the processor will sample the instruction the processor is executing every 233,869 events.

Listing 2.22.

[root@wintermute root]# opcontrol -s Using default event: CPU_CLK_UNHALTED:233869:0:1:1 Using log file /var/lib/oprofile/oprofiled.log Daemon started. Profiler running.

Now that we have started sampling, we want to begin to analyze the sampling results. In Listing 2.23, we start to use the reporting tools to figure out what is happening in the system. opreport reports what has been profiled so far.

Listing 2.23.

[root@wintermute root]# opreport opreport op_fatal_error: No sample file found: try running opcontrol --dump or specify a session containing sample files

Uh oh! Even though the profiling has been happening for a little while, we are stopped when opreport specifies that it cannot find any samples. This is because the opreport command is looking for the samples on disk, but the oprofile daemon stores the samples in memory and only periodically dumps them to disk. When we ask opreport for a list of the samples, it does not find any on disk and reports that it cannot find any samples. To alleviate this problem, we can force the daemon to flush the samples immediately by issuing a dump option to opcontrol , as shown in Listing 2.24. This command enables us to view the samples that have been collected.

Listing 2.24.

[root@wintermute root]# opcontrol --dump

After we dump the samples to disk, we try again, and ask oprofile for the report, as shown in Listing 2.25. This time, we have results. The report contains information about the processor that it was collected on and the types of events that were monitored. The report then lists in descending order the number of events that occurred and which executable they occurred in. We can see that the Linux kernel is taking up 50 percent of the total cycles, emacs is taking 14 percent, and libc is taking 12 percent. It is possible to dig deeper into executable and determine which function is taking up all the time, but that is covered in Chapter 4, "Performance Tools: Process-Specific CPU."

Listing 2.25.

[root@wintermute root]# opreport CPU: PIII, speed 467.739 MHz (estimated) Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 233869 3190 50.4507 vmlinux-2.4.22-1.2174.nptlsmp 905 14.3128 emacs 749 11.8456 libc-2.3.2.so 261 4.1278 ld-2.3.2.so 244 3.8589 mpg321 233 3.6850 insmod 171 2.7044 libperl.so 128 2.0244 bash 113 1.7871 ext3.o ....

When we started the oprofile , we just used the default event that opcontrol chose for us. Each processor has a very rich set of events that can be monitored. In Listing 2.26, we ask opcontrol to list all the events that are available for this particular CPU. This list is quite long, but in this case, we can see that in addition to CPU_CLK_UNHALTED , we can also monitor DATA_MEM_REFS and DCU_LINES_IN . These are memory events caused by the memory subsystem, and we investigate them in later chapters.

Listing 2.26.

[root@wintermute root]# opcontrol -l oprofile: available events for CPU type "PIII" See Intel Architecture Developer's Manual Volume 3, Appendix A and Intel Architecture Optimization Reference Manual (730795-001) CPU_CLK_UNHALTED: (counter: 0, 1) clocks processor is not halted (min count: 6000) DATA_MEM_REFS: (counter: 0, 1) all memory references, cachable and non (min count: 500) DCU_LINES_IN: (counter: 0, 1) total lines allocated in the DCU (min count: 500) ....

The command needed to specify which events we will monitor can be cumbersome, so fortunately, we can also use oprofile 's graphicaloprof_start command to graphically start and stop sampling. This enables us to select the events that we want graphically without the need to figure out the exact way to specify on the command line the events that we want to monitor.

In the example of op_control shown in Figure 2-3, we tell oprofile that we want to monitor DATA_MEM_REFS and L2_LD events at the same time. The DATA_MEM_REFS event can tell us which applications use the memory subsystem a lot and which use the level 2 cache. In this particular processor, the processor's hardware has only two counters that can be used for sampling, so only two events can be used simultaneously .

Figure 2-3.

Now that we have gathered the samples using the graphical interface to operofile , we can now analyze the data that it has collected. In Listing 2.27, we ask opreport to display the profile of samples that it has collected in a similar way to how we did when we were monitoring cycles. In this case, we can see that the libmad library has 31 percent of the data memory references of the whole system and appears to be the heaviest user of the memory subsystem.

Listing 2.27.

[root@wintermute root]# opreport CPU: PIII, speed 467.739 MHz (estimated) Counted DATA_MEM_REFS events (all memory references, cachable and non) with a unit mask of 0x00 (No unit mask) count 30000 Counted L2_LD events (number of L2 data loads) with a unit mask of 0x0f (All cache states) count 233869 87462 31.7907 17 3.8636 libmad.so.0.1.0 24259 8.8177 10 2.2727 mpg321 23735 8.6272 40 9.0909 libz.so.1.2.0.7 17513 6.3656 56 12.7273 libgklayout.so 17128 6.2257 90 20.4545 vmlinux-2.4.22-1.2174.nptlsmp 13471 4.8964 4 0.9091 libpng12.so.0.1.2.2 12868 4.6773 20 4.5455 libc-2.3.2.so ....

The output provided by opreport displays all the system libraries and executables that contain any of the events that we were sampling. Note that not all the events have been recorded; because we are sampling, only a subset of events are actually recorded. This is usually not a problem, because if a particular library or executable is a performance problem, it will likely cause high-cost events to happen many times. If the sampling is random, these high-cost events will eventually be caught by the sampling code.

Thursday, July 28, 2011

Optimizing Linux Performance