When a performance problem is reported, the kind of performance problem will often help the performance analyst to narrow the list of possible culprits.
This topic includes the following major sections:
If everything that uses a particular device or service slows down at times, refer to the topic that covers that device or service:
This may seem to be the trivial case, but there are still questions to be asked:
If the program has just started running slowly, a recent change may be the cause.
If so, check with the programmer or vendor.
If a file used by the program (including its own executable) has been moved, it may now be experiencing LAN delays that weren't there before; or files may be contending for a single disk accessor that were on different disks before.
If the system administrator has changed system-tuning parameters, the program may be subject to constraints that it didn't experience before. For example, if the schedtune -r command has been used to change the way priority is calculated, programs that used to run rather quickly in the background may now be slowed down, while foreground programs have speeded up.
While they allow programs to be written quickly, interpretive languages have the problem that they are not optimized by a compiler. Also, it is easy in a language like awk to request an extremely compute- or I/O-intensive operation with a few characters. It is often worthwhile to perform a desk check or informal peer review of such programs with the emphasis on the number of iterations implied by each operation.
The AIX file system uses some of system memory to hold pages of files for future reference. If a disk-limited program is run twice in quick succession, it will normally run faster the second time than the first. Similar phenomena may be observed with programs that use NFS and DFS. This can also occur with large programs, such as compilers. The program's algorithm may not be disk-limited, but the time needed to load a large executable may make the first execution of the program much longer than subsequent ones.
Identifying the Performance-Limiting Resource describes techniques for finding the bottleneck.
Most people have experienced the rush-hour slowdown that occurs because a large number of people in the organization habitually use the system at one or more particular times each day. This phenomenon is not always simply due to a concentration of load. Sometimes it is an indication of an imbalance that is (at present) only a problem when the load is high. There are also other sources of periodicity in the system that should be considered.
If the disks are unbalanced, look at Monitoring and Tuning Disk I/O.
If the CPU is saturated, use ps to identify the programs being run during this period. The script given in Performance Monitoring Using iostat, netstat, vmstat simplifies the search for the CPU hogs.
If the slowdown is counter-intuitive, such as paralysis during lunch time, look for a pathological program such as a graphic Xlock or game program. Some versions of Xlock are known to use huge amounts of CPU time to display graphic patterns on an idle display. It is also possible that someone is running a program that is a known CPU burner and is trying to run it at the least intrusive time.
If you find that the problem stems from conflict between foreground activity and long-running, CPU-intensive programs that are, or should be, run in the background, you should consider using schedtune -r -d to give the foreground higher priority. See Tuning the Process-Priority-Value Calculation with schedtune.
The best tool for this situation is an overload detector, such as xmperf's filtd program (a component of PTX). filtd can be set up to execute shell scripts or collect specific information when a particular condition is detected. You can construct a similar, but more specialized, mechanism using shell scripts containing vmstat, netstat, and ps.
If the problem is local to a single system in a distributed environment, there is probably a pathological program at work, or perhaps two that intersect randomly.
Sometimes a system seems to "pick on" an individual.
$ time cp .profile testjunk real 0m0.08s user 0m0.00s sys 0m0.01s
Then run them under a satisfactory userid. Is there a difference in the reported real time?
There are some common problems that arise in the transition from independent systems to distributed systems. They usually result from the need to get a new configuration running as soon as possible, or from a lack of awareness of the cost of certain functions. In addition to tuning the LAN configuration in terms of MTUs and mbufs (see the Monitoring and Tuning Communications I/O chapter), we should look for LAN-specific pathologies or nonoptimal situations that may have evolved through a sequence of individually reasonable decisions.
When a broadcast storm occurs, even systems that are not actively using the network can be slowed by the incessant interrupts and by the CPU resource consumed in receiving and processing the packets. These bugs are better detected and localized with LAN analysis devices than with normal AIX performance tools.
Using an AIX system as a router consumes large amounts of CPU time to process and copy packets. It is also subject to interference from other work being processed by the AIX system. Dedicated hardware routers and bridges are usually a more cost-effective and robust solution to the need to connect LANs.
At some stages in the development of distributed configurations, NFS mounts are used to give users on new systems access to their home directories on their original systems. This simplifies the initial transition, but imposes a continuing data communication cost. It is not unknown to have users on system A interacting primarily with data on system B and vice versa.
Access to files via NFS imposes a considerable cost in LAN traffic, client and server CPU time, and end-user response time. The general principle should be that user and data should normally be on the same system. The exceptions are those situations in which there is an overriding concern that justifies the extra expense and time of remote data. Some examples are a need to centralize data for more reliable backup and control, or a need to ensure that all users are working with the most current version of a program.
If these and other needs dictate a significant level of NFS client-server interchange, it is better to dedicate a system to the role of server than to have a number of systems that are part-server, part-client.
The simplest method of porting a program into a distributed environment is to replace program calls with RPCs on a 1:1 basis. Unfortunately, the disparity in performance between local program calls and RPCs is even greater than the disparity between local disk I/O and NFS I/O. Assuming that the RPCs are really necessary, they should be batched whenever possible.
Make sure you have followed the configuration recommendations in the appropriate subsystem manual and/or the recommendations in the appropriate "Monitoring and Tuning" chapter of this book.