AIX Versions 3.2 and 4 Performance Tuning Guide

Chapter 4. Performance-Conscious Planning, Design, and Implementation

A program that does not perform acceptably is not functional.

Every program has to satisfy a set of users--admittedly, sometimes a large and diverse set. If the performance of the program is truly unacceptable to a significant group of those users, it will not be used. A program that is not being used is not performing its intended function.

This is true of licensed software packages as well as user-written applications, although most developers of software packages are aware of the effects of poor performance and take pains to make their programs run as fast as possible. Unfortunately, they can't anticipate all of the environments and uses that their programs will experience. Final responsibility for acceptable performance falls on the people who select or write, plan for, and install software packages.

This chapter attempts to describe the stages by which a programmer or system administrator can ensure that a newly written or purchased program has acceptable performance. (Wherever the word programmer appears alone, the term includes system administrators and anyone else who is responsible for the ultimate success of a program.)

The way to achieve acceptable performance in a program is to identify and quantify acceptability at the start of the project and never lose sight of the measures and resources needed to achieve it. This prescription borders on banal, but some programming projects consciously reject it. They adopt a policy that might be fairly described as "design, code, debug, maybe document, and if we have time, fix up the performance."

The only way that programs can predictably be made to function in time, not just in logic, is by integrating performance considerations in the software planning and development process. Advance planning is perhaps more critical when existing software is being installed, because the installer has fewer degrees of freedom than the developer.

Although the detail of this process may seem burdensome for a small program, remember that we have a second agenda. Not only must the new program have satisfactory performance; we must also ensure that the addition of that program to an existing system does not cause the performance of other programs run on that system to become unsatisfactory.

This topic includes the following major sections:

Identifying the Components of the Workload
Documenting Performance Requirements
Estimating the Resource Requirements of the Workload
Estimating the Resources Required by a New Program
Transforming Program-Level Estimates to Workload Estimates

Related sections are:

Identifying the Components of the Workload

Whether the program is new or purchased, small or large, the developers, the installers, and the prospective users have assumptions about the use of the program, such as:

Who will be using the program
Situations in which the program will be run
How often those situations will arise and at what times of the hour, day, month, year
Whether those situations will also require additional uses of existing programs
Which systems the program will run on
How much data will be handled, and from where
Whether data created by or for the program will be used in other ways

Unless these ideas are elicited as part of the design process, they will probably be vague, and the programmers will almost certainly have different assumptions than the prospective users. Even in the apparently trivial case in which the programmer is also the user, leaving the assumptions unarticulated makes it impossible to compare design to assumptions in any rigorous way. Worse, it is impossible to identify performance requirements without a complete understanding of the work being performed.

Documenting Performance Requirements

In identifying and quantifying performance requirements, it is important to identify the reasoning behind a particular requirement. Users may be basing their statements of requirements on assumptions about the logic of the program that do not match the programmer's assumptions. At a minimum, a set of performance requirements should document:

The maximum satisfactory response time that will be experienced most of the time for each distinct type of user-computer interaction, along with a definition of "most of the time." Remember that response time is measured from the time that the user performs the action that says "Go" until the user receives enough feedback from the computer to continue the task. It is the user's subjective wait time. It is not "from entry to my subroutine until the first write statement."
If the user denies interest in response time and indicates that only the answer is of interest, you can ask whether (ten times your current estimate of stand-alone execution time) would be acceptable. If the answer is "yes," you can proceed to discuss throughput. Otherwise, you can continue the discussion of response time with the user's full attention.
The response time that is just barely tolerable the rest of the time. Anything longer and people start thinking the system is down--or at least blaming the computer for a loss of productivity and becoming averse to using it. You also need to specify "rest of the time;" the peak minute of a day, 1% of interactions, etc. This should also be in user-subjective terms at first. For example, response time degradations may be more costly or painful at a particular time of the day.
The typical throughput required and the times it will be taking place. Again, this should not be shrugged aside. For example, the requirement for one program might be: "This program only runs twice a day--at 10:00 a.m. and 3:15 p.m." If this is a CPU-limited program that runs for 15 minutes and is planned to run on a multiuser system, some negotiation is in order.
The size and timing of maximum-throughput periods.
The mix of requests expected and how the mix varies with time.
The number of users per machine and total number of users, if this is a multiuser application. This description should include the times these users log on and off, as well as their assumed rates of keystrokes, completed requests, and think times. You may want to investigate whether think times vary systematically with the preceding and/or following request.
Any assumptions the user is making about the machines the workload will run on. If the user has a specific existing machine in mind, you should know that now. Similarly, if the user is assuming a particular type, size, cost, location, interconnection, or any other variable that will constrain your ability to satisfy the preceding requirements, that assumption becomes part of the requirements as well. Satisfaction will probably not be assessed on the system where the program is developed, tested, or first installed.

Estimating the Resource Requirements of the Workload

Unless you are purchasing a software package that comes with detailed resource-requirement documentation, resource estimation can be the most difficult task in the performance-planning process. The difficulty has several causes:

In AIX there are several ways to do anything. One can write a C (or other HLL) program, a shell script, an awk script, a sed script, an AIXwindows dialog, etc. Some techniques that may seem particularly suitable for the algorithm and for programmer productivity are extraordinarily expensive from the performance perspective.
A useful guideline is that, the higher the level of abstraction, the more caution is needed to ensure that one doesn't receive a performance surprise. One must think very carefully about the data volumes and number of iterations implied by some apparently harmless constructs.
In AIX it is difficult to define the precise cost of a single process. This difficulty is not merely technical; it is philosophical. If multiple instances of a given program run by multiple users are sharing pages of program text, which process should be charged with those pages of memory? The operating system leaves recently used file pages in memory to provide a caching effect for programs that reaccess that data. Should programs that reaccess data be charged for the space that was used to keep the data around? The granularity of some measurements such as the system clock can cause variations in the CPU time attributed to successive instances of the same program.
There are two approaches to dealing with resource-report ambiguity and variability. The first is to ignore the ambiguity and to keep eliminating sources of variability until the measurements become acceptably consistent. The second approach is to try to make the measurements as realistic as possible and describe the results statistically. We prefer the latter, since it yields results that have some correlation with production situations.
AIX systems are rarely dedicated to the execution of a single instance of a single program. There are almost always daemons running, frequently communications activity, often workload from multiple users. These activities seldom combine additively. For example, increasing the number of instances of a given program may result in few new program text pages being used, because most of the program was already in memory. However, the additional process may result in more contention for the processor's caches, so that not only do the other processes have to share processor time with the newcomer, but all processes may experience more cycles per instruction--in effect, a slowdown of the processor--as a result of more frequent cache misses.
Our recommendation is to keep your estimate as close to reality as the specific situation allows:
- If the program exists, measure the existing installation that most closely resembles your own requirements.
- If no suitable installation is available, do a trial installation and measure a synthetic workload.
- If it is impractical to generate a synthetic workload that matches the requirements, measure individual interactions and use the results as input to a simulation.
- If the program doesn't exist yet, find a comparable program that uses the same language and general structure, and measure it. Again, the more abstract the language, the more care is needed in determining comparability.
- If no comparable program exists, prototype the main algorithms in the planned language, measure the prototype, and model the workload.
- If, and only if, measurement of any kind is impossible or infeasible should you make an educated guess. If it is necessary to guess at resource requirements during the planning stage, it is even more important than usual that the actual program be measured at the earliest possible stage of its development.

In resource estimation, we are primarily interested in four dimensions (in no particular order):

CPU time	Processor cost of the workload
Disk accesses	Rate at which the workload generates disk reads or writes
Real memory	Amount of RAM the workload requires
LAN traffic	Number of packets the workload generates and the number of bytes of data exchanged

The following sections describe, or refer you to descriptions of, the techniques for determining these values in the various situations just described.

Measuring Workload Resources

If the real program, a comparable program, or a prototype is available for measurement, the choice of technique depends on:

Whether or not the system is processing other work in addition to the workload we want to measure.
Whether or not we have permission to use tools that may degrade performance (for example, is this system in production or is it dedicated to our use for the duration of the measurement?).
The degree to which we can simulate or observe an authentic workload.

Measuring a Complete Workload on a Dedicated System

This is the ideal situation because it allows us to use measurements that include system overhead as well as the cost of individual processes.

To measure CPU and disk activity, we can use iostat. The command

$ iostat 5 >iostat.output

gives us a picture of the state of the system every 5 seconds during the measurement run. Remember that the first set of iostat output contains the cumulative data from the last boot to the start of the iostat command. The remaining sets are the results for the preceding interval, in this case 5 seconds. A typical set of iostat output on a large system looks like this:

tty:      tin         tout      cpu:   % user    % sys     % idle    % iowait
          1.2          1.6              60.2     10.8       23.4       5.6
Disks:        % tm_act     Kbps      tps    Kb_read   Kb_wrtn
hdisk1           0.0       0.0       0.0          0         0
hdisk2           0.0       0.0       0.0          0         0
hdisk3           0.0       0.0       0.0          0         0
hdisk4           0.0       0.0       0.0          0         0
hdisk11          0.0       0.0       0.0          0         0
hdisk5           0.0       0.0       0.0          0         0
hdisk6           0.0       0.0       0.0          0         0
hdisk7           3.0      11.2       0.8          8        48
hdisk8           1.8       4.8       1.2          0        24
hdisk9           0.0       0.0       0.0          0         0
hdisk0           2.0       4.8       1.2         24         0
hdisk10          0.0       0.0       0.0          0         0

To measure memory, we would use svmon. The command svmon -G gives a picture of overall memory use. The statistics are in terms of 4KB pages:

$ svmon -G
       m e m o r y            i n  u s e            p i n        p g  s p a c e
  size inuse  free   pin   work  pers  clnt   work  pers  clnt     size   inuse
 24576 24366   210  2209  15659  6863  1844   2209     0     0    40960   26270

This machine's 96MB memory is fully used. About 64% of RAM is in use for working segments--the read/write memory of running programs. If there are long-running processes that we are interested in, we can review their memory requirements in detail. The following example determines the memory used by one of user xxxxxx's processes.

$ ps -fu xxxxxx
    USER   PID  PPID   C    STIME    TTY   TIME   CMD
  xxxxxx 28031 51445  15 14:01:56  pts/9   0:00   ps -fu xxxxxx 
  xxxxxx 51445 54772   1 07:57:47  pts/9   0:00   -ksh 
  xxxxxx 54772  6864   0 07:57:47      -   0:02   rlogind 
         
$ svmon -P 51445  
  Pid                         Command        Inuse        Pin      Pgspace
51445                             ksh         1668          2         4077
        
Pid:  51445
Command:  ksh
        
Segid  Type  Description         Inuse   Pin  Pgspace   Address Range
 8270  pers  /dev/fslv00:86079       1     0       0    0..0
 4809  work  shared library       1558     0    4039    0..4673 : 60123..65535
 9213  work  private                37     2      38    0..31 : 65406..65535
  8a1  pers  code,/dev/hd2:14400    72     0       0    0..91

The working segment (9213), with 37 pages in use, is the cost of this instance of ksh. The 1558-page cost of the shared library and the 72-page cost of the ksh executable are spread across all of the running programs and all instances of ksh, respectively.

If we believe that our 96MB system is larger than necessary, we can use the rmss command to reduce the effective size of the machine and remeasure the workload. If paging increases significantly or response time deteriorates, we have reduced memory too far. This technique can be continued until we find a size that just runs our workload without degradation. See Assessing Memory Requirements via the rmss Command for more information on this technique.

The primary command for measuring network usage is netstat. The following example shows the activity of a specific Token-Ring interface:

$ netstat -I tr0 5 
   input    (tr0)     output            input   (Total)    output
 packets  errs  packets  errs colls   packets  errs  packets  errs colls 
35552822 213488 30283693     0     0  35608011 213488 30338882     0     0
     300     0      426     0     0       300     0      426     0     0
     272     2      190     0     0       272     2      190     0     0
     231     0      192     0     0       231     0      192     0     0
     143     0      113     0     0       143     0      113     0     0
     408     1      176     0     0       408     1      176     0     0

The first line of the report shows the cumulative network traffic since the last boot. Each subsequent line shows the activity for the preceding 5-second interval.

Measuring a Complete Workload on a Production System

The techniques of measurement on production systems are similar to those on dedicated systems, but we must take pains to avoid degrading system performance. For example, the svmon -G command is very expensive to run. The one shown earlier took about 5 seconds of CPU time on a Model 950. Estimates of the resource costs of the most frequently used performance tools are shown in Appendix E, Performance of the Performance Tools.

Probably the most cost-effective tool is vmstat, which supplies data on memory, I/O, and CPU usage in a single report. If the vmstat intervals are kept reasonably long, say 10 seconds, the average cost is low--about .01 CPU seconds per report on a model 950. See Identifying the Performance-Limiting Resource for more information on the use of vmstat.

Measuring a Partial Workload on a Production System

By partial workload we mean measuring a part of the production system's workload for possible transfer to or duplication on a different system. Because this is a production system, we must be as unobtrusive as possible. At the same time, we have to analyze the workload in more detail to distinguish between the parts we are interested in and those we aren't. To do a partial measurement we need to discover what the workload elements of interest have in common. Are they:

The same program or a small set of related programs?
Work performed by one or more specific users of the system?
Work that comes from one or more specific terminals?

Depending on the commonality, we could use one of the following:

ps -ef | grep pgmname
ps -fuusername, . . .
ps -ftttyname, . . .

to identify the processes of interest and report the cumulative CPU time consumption of those processes. We can then use svmon (judiciously!) to assess the memory use of the processes.

Measuring an Individual Program

There are many tools for measuring the resource consumption of individual programs. Some of these programs are capable of more comprehensive workload measurements as well, but are too intrusive for use on production systems. Most of these tools are discussed in depth in the chapters that discuss tuning for minimum consumption of specific resources. Some of the more prominent are:

time	measures the elapsed execution time and CPU consumption of an individual program. Discussed in Using the time Command to Measure CPU Use.
tprof	measures the relative CPU consumption of programs, subroutine libraries, and the AIX kernel. Discussed in Using tprof to Analyze Programs for CPU Use.
svmon	measures the real memory used by a process. Discussed in How Much Memory is Really Being Used?.
vmstat -s	can be used to measure the I/O load generated by a program. Discussed in Measuring Overall Disk I/O with vmstat.

Estimating the Resources Required by a New Program

It is impossible to make precise estimates of unwritten programs. The invention and redesign that take place during the coding phase defy prediction, but the following rules of thumb may help you to get a general sense of the requirements. As a starting point, a minimal program would need:

CPU time
- About 50 milliseconds, mostly system time.
Real Memory
- One page for program text
- About 15 pages (of which 2 are pinned) for the working (data) segment
- Access to libc.a. Normally this is shared with all other programs and is considered part of the base cost of the operating system.
Disk I/O
- About 12 page-in operations, if the program has not been compiled, copied, or used recently; 0 otherwise.

Add to that basic cost allowances for demands implied by the design (the CPU times given are for a Model 580):

CPU time
- The CPU consumption of an ordinary program that does not contain high levels of iteration or costly subroutine calls is almost unmeasurably small.
- If the proposed program contains a computationally expensive algorithm, the algorithm should be prototyped and measured.
- If the proposed program uses computationally expensive library subroutines, such as X or Motif constructs or printf, measure their CPU consumption with otherwise trivial programs.
Real Memory
- Allow (very approximately) 350 lines of code per page of program text. That is about 12 bytes per line. Keep in mind that coding style and compiler options can make a factor of two difference in either direction. This allowance is for pages that are touched in your typical scenario. If your design places infrequently executed subroutines at the end of the executable, those pages will not normally take up real memory.
- References to shared libraries other than libc.a will increase the memory requirement only to the extent that those libraries are not shared with other programs or instances of the program being estimated. To measure the size of these libraries, write a trivial, long-running program that references them and use svmon -P against the process.
- Estimate the amount of storage that will be required by the data structures identified in the design. Round up to the nearest page.
- In the short run, each disk I/O operation will use one page of memory. Assume that the page has to be available already. Don't assume that the program will wait for another program's page to be freed.
Disk I/O
- For sequential I/O, each 4096 bytes read or written causes one I/O operation, unless the file has been accessed recently enough that some of its pages are still in memory.
- For random I/O, each access, however small, to a different 4096-byte page causes one I/O operation, unless the file has been accessed recently enough that some of its pages are still in memory.
- Under laboratory conditions, each sequential read or write of a 4KB page in a large file takes about 140+/-20 microseconds of CPU time. Each random read or write of a 4KB page takes about 350+/-40 microseconds of CPU time. Remember that real files are not necessarily stored sequentially on disk, even though they are written and read sequentially by the program. Consequently, the typical CPU cost of an actual disk access will be closer to the random-access cost than to the sequential-access cost.
Communications I/O
- If disk I/O is actually to AFS or NFS remote-mounted file systems, the disk I/O is performed on the server, but the client experiences higher CPU and memory demands.
- RPCs of any kind contribute substantially to the CPU load. The proposed RPCs in the design should be minimized, batched, prototyped, and measured in advance.
- Under laboratory conditions, each sequential NFS read or write of an 4KB page takes about 670+/-30 microseconds of client CPU time. Each random NFS read or write of a 4KB page takes about 1000+/-200 microseconds of client CPU time.

Transforming Program-Level Estimates to Workload Estimates

The best method for estimating peak and typical resource requirements is to use a queuing model such as BEST/1. Static models can be used, but you run the risk of overestimating or underestimating the peak resource. In either case, you need to understand how multiple programs in a workload interact from the standpoint of resource requirements.

If you are building a static model, use a time interval that is the specified worst-acceptable response time for the most frequent or demanding program (usually they are the same). Determine, based on your projected number of users, their think time, their key entry rate, and the anticipated mix of operations, which programs will typically be running during each interval.

CPU time
- Add together the CPU requirements for the all of the programs that are running during the interval. Include the CPU requirements of the disk and communications I/O the programs will be doing.
- If this number is greater than 75% of the available CPU time during the interval, consider fewer users or more CPU.
Real Memory
- Start with 6 to 8MB for the operating system itself. The lower figure is for a standalone system. The latter for a system that is LAN-connected and uses TCP/IP and NFS.
- Add together the working segment requirements of all of the instances of the programs that will be running during the interval, including the space estimated for the program's data structures.
- Add to that total the memory requirement of the text segment of each distinct program that will be running (one copy of the program text serves all instances of that program). Remember that any (and only) subroutines that are from unshared libraries will be part of the executable, but the libraries themselves will not be in memory.
- Add to the total the amount of space consumed by each of the shared libraries that will be used by any program in the workload. Again, one copy serves all.
- To allow adequate space for some file caching and the free list, your total memory projection should not exceed 80% of the size of the machine to be used.
Disk I/O
- Add the number of I/Os implied by each instance of each program. Keep separate totals for I/Os to small files or randomly to large files versus purely sequential reading or writing of large files (more than 32KB)
- Subtract those I/Os that you believe will be satisfied from memory. Any record that was read or written in the previous interval is probably still available in the current interval. Beyond that, you need to look at the size of the proposed machine versus the total RAM requirements of the machine's workload. Any space left over after the operating system's requirement and the workload's requirements probably contains the most recently read or written file pages. If your application's design is such that there is a high probability of reuse of recently accessed data, you can calculate an allowance for the caching effect. Remember that the reuse is at the page level, not at the record level. If the probability of reuse of a given record is low, but there are a lot of records per page, it is likely that some of the records needed in any given interval will fall in the same page as other, recently used, records.
- Compare the net I/O requirements to the table showing the approximate capabilities of current disk drives. If the random or sequential requirement is greater than 75% of the total corresponding capability of the disks that will hold application data, tuning and possibly expansion will be needed when the application is in production.
Communications I/O
- Calculate the bandwidth consumption of the workload. If the total bandwidth consumption of all of the nodes on the LAN is greater than 70% of nominal bandwidth (50% for Ethernet) there is cause for concern.
- You should carry out a similar analysis of CPU, memory, and I/O requirements of the added load that will be placed on the server.

Remember that these guidelines for a "back of an envelope" estimate are intended for use only when no extensive measurement is possible. Any application-specific measurement that can be used in place of a guideline will improve the accuracy of the estimate considerably.