The optimal settings of the tunable communications parameters vary with the type of LAN as well as with the communications-I/O characteristics of the predominant system and application programs. The following sections describe the global principles of communications tuning, followed by specific recommendations for the different types of LAN.
You can choose to tune primarily either for maximum throughput or for minimum memory use. Some recommendations apply to one or the other; some apply to both.
The former will be placed in one to four mbufs; the latter will make efficient use of the 4096-byte clusters that are used for writes larger than 935 bytes. Writing 936 bytes would result in 3160 bytes of wasted space per write. The application could hit the udp_recvspace default value of 65536 with just 16 writes totaling 14976 bytes of data.
If the application were using TCP, this would waste time as well as memory. TCP tries to form outbound data into MTU-sized packets. If the MTU of the LAN were larger than 14976 bytes, TCP would put the sending thread to sleep when the tcp_sendspace limit was reached. It would take a timeout ACK from the receiver to force the data to be written.
Note: When the no command is used to change parameters, the change is in effect only until the next system boot. At that point all parameters are initially reset to their defaults. To make the change permanent, you should put the appropriate no command in the /etc/rc.net file.
For transmit, the device drivers may provide a "transmit queue" limit. There may be both hardware queue and software queue limits, depending on the driver and adapter. Some drivers have only a hardware queue, some have both hardware and software queue's. Some drivers internally control the hardware queue and only allow the software queue limits to be modified. Generally, the device driver will queue a transmit packet directly to the adapter hardware queue. If the system CPU is fast relative to the speed of the network, or on a SMP system, the system may produce transmit packets faster than they can be transmitted on the network. This will cause the hardware queue to fill. Once the hardware queue is full, some drivers provide a software queue and they will then queue to the software queue. If the software transmit queue limit is reached, then the transmit packets are discarded. This can affect performance because the upper level protocols must then retransmit the packet.
Prior to AIX release 4.2.1, the upper limits on the transmit queue's were in the range of 150 to 250, depending on the specific adapter. The system default values were quite low, typically 30. With AIX release 4.2.1 and later, the transmit queue limits were increased on most of the device drivers to 2048 buffers. The default values were also increased to 512 for most of these drivers. The default values were increased because of the faster CPU's and SMP systems can overrun the smaller queue limits.
For adapters that provide hardware queue limits, changing these values will cause more real memory to be consumed because of the associated control blocks and buffers associated with them. Therefore these limits should only be raised if needed or for larger systems where the increase in memory use is neglegable. For the software transmit queue limits, increasing these does not increase memory usage. It only allows packets to be queued that were already allocated by the higher layer protocols.
Some adapters allow you to configure the number of resources used for receiving packets from the network. This might include the number of receive buffers (and even their size) or may simply be a receive queue parameter (which indirectly controls the number of receive buffers.
The receive resources may need to be increased to handle peak bursts on the network.
In AIX 3.2.x versions, the drivers allowd special applications to read received packets directly from the device driver. The device driver maintained a 'receive queue' where these packets were queued. To limit the number of packets that would be queued for these applications, a receive queue size parameter was provided. In AIX Version 4, this interface is not supported, except for old MicroChannel adapters when the bos.compat LPP is installed. See parameter table for specific adapters.
For all the newer Microchannel adapters and the PCI adapters, receive queue parameters typically control the number of receive buffers that are provide to the adapter for receiving input packets.
AIX release 4.1.4 and later support Device Specific Mbuf's. This allows a driver to allocate its own private set of buffers and have them pre-setup for DMA. This can provide additional performance because the overhead to set up the DMA mapping is done one time. Also, the adapter can allocate buffer sizes that are best suited to its MTU size. For example, ATM, HIPPI and the SP2 switch support a 64K MTU (packet) size. The maximum system mbuf size is 16K bytes. By allowing the adapter to have 64K byte buffers, large 64K writes from applications can be copied directly into the 64KB buffers owned by the adapter, instead of copying them into multiple 16K buffers (which has more overhead to allocate and free the extra buffers).
The adapters that support Device Specific Mbuf's are MCA ATM, MCA HIPPI, and the various SP2 high speed switch adapters.
Device specific buffers add an extra layer of complexity for the system administrator however. The system administrator must use device specific commands to view the statistics relating to the adapters buffers and then change the adapers parameters as necessary. If the statistics indicate that packets were discarded due to not enough buffer resources, then those buffer sizes need to be increased.
Due to difference between drivers and the utilities used to alter these parameters, they are not fully described here. The MCA ATM parameters are listed in the table below. The statistics for ATM can be viewed using the atmstat -d atm0 command (substitute your ATM interface number as needed).
There several status utilities that can be used. For AIX 4.1.0 and later, the adapter statistics show the transmit queue high water limits and number of queue overflows. You can use netstat -v, or go directly to the adapter statistics utilities (entstat for Ethernet, tokstat for Token Ring, fddistat for FDDI, atmstat for ATM, etc.)
For example, entstat -d en0 output is below. This shows the statistics from en0. The -d options will list any extended statistics for this adapter and should be used to ensure all statistics are displayed. The "Max Packets on S/W Transmit Queue:" field will show the high water mark on the transmit queue. The "S/W Transmit Queue Overflow:" field will show the number of software queue overflows.
Note: These values may represent the "hardware queue" if the adapter does not support a software transmit queue". If there are Transmit Queue Overflows, then the hardware or software queue limits for the driver should be increased.
If there are not enough receive resources, this would be indicated by "Packets Dropped:" and depending on the adapter type, would be indicated by "Out of Rcv Buffers" or "No Resource Errors:" or some similar counter.
ETHERNET STATISTICS (en1) : Device Type: IBM 10/100 Mbps Ethernet PCI Adapter (23100020) Hardware Address: 00:20:35:b5:02:4f Elapsed Time: 0 days 0 hours 7 minutes 22 seconds Transmit Statistics: Receive Statistics: -------------------- ------------------- Packets: 1869143 Packets: 1299293 Bytes: 2309523868 Bytes: 643101124 Interrupts: 0 Interrupts: 823040 Transmit Errors: 0 Receive Errors: 0 Packets Dropped: 0 Packets Dropped: 0 Bad Packets: 0 Max Packets on S/W Transmit Queue: 41 S/W Transmit Queue Overflow: 0 Current S/W+H/W Transmit Queue Length: 1 Broadcast Packets: 1 Broadcast Packets: 0 Multicast Packets: 0 Multicast Packets: 0 No Carrier Sense: 0 CRC Errors: 0 DMA Underrun: 0 DMA Overrun: 0 Lost CTS Errors: 0 Alignment Errors: 0 Max Collision Errors: 0 No Resource Errors: 0 Late Collision Errors: 0 Receive Collision Errors: 0 Deferred: 3778 Packet Too Short Errors: 0 SQE Test: 0 Packet Too Long Errors: 0 Timeout Errors: 0 Packets Discarded by Adapter: 0 Single Collision Count: 96588 Receiver Start Count: 0 Multiple Collision Count: 56661 Current HW Transmit Queue Length: 1 General Statistics: ------------------- No mbuf Errors: 0 Adapter Reset Count: 0 Driver Flags: Up Broadcast Running Simplex 64BitSupport
Another method is to use the 'netstat -i' utility. If it shows non-zero counts in the "Oerrs" column for an interface, then typically this is the result of output queue overflows. This works for all version of AIX.
You can use the list attributes command lsattr -E -l [adapter-name] or you can use SMIT to show the adapter configuration.
Different adapters have different names for these variables. For example, they may be named "sw_txq_size", "tx_que_size", "xmt_que_size" to name a few for the transmit queue parameter. The receive queue size and or receive buffer pool parameters may be named "rec_que_size", rx_que_size", or "rv_buf4k_min" for example.
Below is the output of a lsattr -E -l atm0 command on a IBM PCI 155 Mbs ATM adapter. This shows the sw_txq_size is set to 250 and the rv_buf4K_min receive buffers set to x30.
==== lsattr -E ======== dma_mem 0x400000 N/A False regmem 0x1ff88000 Bus Memory address of Adapter Registers False virtmem 0x1ff90000 Bus Memory address of Adapter Virtual Memory False busintr 3 Bus Interrupt Level False intr_priority 3 Interrupt Priority False use_alt_addr no Enable ALTERNATE ATM MAC address True alt_addr 0x0 ALTERNATE ATM MAC address (12 hex digits) True sw_txq_size 250 Software Transmit Queue size True max_vc 1024 Maximum Number of VCs Needed True min_vc 32 Minimum Guaranteed VCs Supported True rv_buf4k_min 0x30 Minimum 4K-byte pre-mapped receive buffers True interface_type 0 Sonet or SDH interface True adapter_clock 1 Provide SONET Clock True uni_vers auto_detect N/A True
Here is an example of a Microchannel 10/100 Ethernet settings using the lsattr -E -l ent0. This shows the tx_que_size and rx_que_size both set to 256.
bus_intr_lvl 11 Bus interrupt level False intr_priority 3 Interrupt priority False dma_bus_mem 0x7a0000 Address of bus memory used for DMA False bus_io_addr 0x2000 Bus I/O address False dma_lvl 7 DMA arbitration level False tx_que_size 256 TRANSMIT queue size True rx_que_size 256 RECEIVE queue size True use_alt_addr no Enable ALTERNATE ETHERNET address True alt_addr 0x ALTERNATE ETHERNET address True media_speed 100_Full_Duplex Media Speed True ip_gap 96 Inter-Packet Gap True
The easiest way is to detach the interface (ifconfig en0 detach, for example) and then use SMIT -> devices -> communicaitons -> [adapter type] -> change/show... to show the adapter settings. After you show the settings, move the cursor to the field you would like to change and press F4 to see the minimum and maximum range for the field or the specific set of sizes supported. You can select one of these sizes and then press enter to enter the request and update the ODM database. Bring the adaper back on line (ifconfig en0 [hosthame], for example).
The other method to change these parameters is to use the chdev command.
chdev -l [ifname] -a [attribute-name]=newvalue
For example, to change the above tx_que_size on en0 down to 128 use the following sequence of commands. Note that this driver only supports four different sizes, so it is better to use SMIT to see these values.
ifconfig en0 detach chdev -l ent0 -a tx_que_size=128 ifconfig en0 [hostname] up
The following information is provided to document the various adapter tuning parameters. These parameters and values are for AIX 4.3.1 and are subject to change. They are provided to aid in understanding the various tuning parameters, or when a system is not available to view the parameters.
This parameter names, defaults, and range values were obtained from the ODM database. The comment field was obtained from the lsattr -E -l interface-name command.
The Notes field is provided as additional comments.
Feature Code: 2980 (codename can't say) Ethernet High-Performance LAN Adapter (8ef5) Parameter Default Range Comment Notes ------------- -------- -------- ------------------------- -------------------- xmt_que_size 512 20-2048 TRANSMIT queue size SW TX queue rec_que_size 30 20-150 RECEIVE queue size See Note 1 rec_pool_size 37 16-64 RECEIVE buffer pool size On Adapter
Notes:
- This is a software receive queue that is provided only for compatability with AIX 3.2.x applications that use the network device driver interface to read packets directly from the driver. This queue limits how many input packets are queued for these applications to receive. This parameter is defined only if bos.compat is installed.
This queue is not use by the normal TCP/IP stack.
Feature Code: 2992 (codename Durandoak) Ethernet High-Performance LAN Adapter (8f95) Parameter Default Range Comment Notes ------------- --------- -------- ------------------- ---------- xmt_que_size 512 20-2048 TRANSMIT queue size SW queue Feature Code: 2994 (codename SanRemo) IBM 10/100 Mbps Ethernet TX MCA Adapter (8f62) Parameter Default Range Comment Notes ------------- -------- ---------------- --------------------- ----------- tx_que_size 64 16,32,64,128,256 TRANSMIT queue size HW queue rx_que_size 32 16,32,64,128,256 RECEIVE queue size HW queue Feature Code: 2970 (codename Monterey) Token-Ring High-Performance Adapter (8fc8) Parameter Default Range Comment Notes ------------- -------- -------- --------------------- ------------ xmt_que_size 99 32-2048 TRANSMIT queue size SW queue rec_que_size 30 20-150 RECEIVE queue size See Note 1 Feature Code: 2972 (codename Wildwood) Token-Ring High-Performance Adapter (8fa2) Parameter Default Range Comment Notes ------------- -------- -------- ---------------------------- ---------- xmt_que_size 512 32-2048 TRANSMIT queue size SW queue rx_que_size 32 32-160 HARDWARE RECEIVE queue size HW queue Feature Code: 2727 (codename Scarborgh) FDDI Primary Card, Single Ring Fiber Parameter Default Range Comment Notes ------------- -------- -------- ------------------------------ -------------------- tx_que_size 512 3-2048 Transmit Queue Size (in mbufs) rcv_que_size 30 20-150 Receive Queue See Note 1 Feature Code: 2984 (codename Bumelia) 100 Mbps ATM Fiber Adapter (8f7f) Parameter Default Range Comment Notes --------------- -------- ------------------------------ ---------------------------------------- -------------------- sw_queue 512 0-2048 Software transmit queue len. SW Queue dma_bus_width 0x1000000 0x800000-0x40000000,0x100000 Amount of memory to map for DMA See Note 3 max_sml_bufs 50 40-400 Maximum Small ATM mbufs Max 256 byte buffers max_med_bufs 100 40-1000 Maximum Medium ATM mbufs Max 4KB buffers max_lrg_bufs 300 75-1000 Maximum Large ATM mbufs Max 8KB buffers, See note 2 max_hug_bufs 50 0-400 Maximum Huge ATM mbufs Max 16KB buffers max_spec_bufs 4 0-400 Maximum ATM MTB mbufs Max of max_spec_buf size spec_buf_size 64 32-1024 Max Transmit Block (MTB) size (kbytes) sml_highwater 20 10-200 Minimum Small ATM mbufs Min 256 byte buffers med_highwater 30 20-300 Minimum Medium ATM mbufs Min 4KB buffers lrg_highwater 70 65-400 Minimum Large ATM mbufs Min 8KB buffers hug_highwater 10 4-300 Minimum Huge ATM mbufs Min 16KB buffers spec_highwater 20 0-300 Minimum ATM MTB mbufs Min 64KB buffers best_peak_rate 1500 1-155000 Virtual Circuit Peak Segamentation Rate- MCA ATM, the rcv side also uses the Large (8K) buffers. The receive logic only uses the 8K buffers so if this size runs low it will affect receive performance.
The other buffers sizes are only for Transmit buffers.
- MCA ATM, If you need to increase the total number of buffers, you may need to change the dma_bus_width = 0x1000000 parm. DMA bus memory width controls the total amount of memory used for ATM buffers. Increase this parameter if you get an error while increasing the maximum buffers, or highwater limits.
Feature Code: 2989 (codename Clawhammer) 155 Mbps ATM Fiber Adapter (8f67) Parameter Default Range Comment Notes ------------- -------- -------- ---------- ------- (same as ATM 100 adapter above)
Feature Code 2985 (codename Klickitat) IBM PCI Ethernet Adapter (22100020) Parameter Default Range Comment Notes ------------- -------- ----------------- ------------------- --------- tx_que_size 64 16,32,64,128,256 TRANSMIT queue size HW Queues rx_que_size 32 16,32,64,128,256 RECEIVE queue size HW Queues Featue Code 2968 (codename Phoenix) IBM 10/100 Mbps Ethernet PCI Adapter (23100020) Parameter Default Range Comment Notes ---------------- ------- ---------------- --------------------- -------------------- tx_que_size 256 16,32,64,128,256 TRANSMIT queue size HW Queue Note 1 rx_que_size 256 16,32,64,128,256 RECEIVE queue size HW Queue Note 2 rxbuf_pool_size 384 16-2048 # buffers in receive Dedicated receive buffer pool buffers Note 3
Notes on Phoenix:
- Prior to AIX 4.3.2, default tx_queue_size was 64
- Prior to AIX 4.3.2, default rx_que_size was 32
- AIX 4.3.2 the driver added a new parameter to control the number of buffers dedicated to receiving packets.
Feature Code: 2969 (codename Galaxy) Gigabit Ethernet-SX PCI Adapter (14100401) Parameter Default Range Comment Notes --------- ------- ----- ------------------------------------------- -------- tx_que_size 512 512-2048 Software Transmit Queueu size SW Queue rx_que_size 512 512 Receive queue size HW Queue receive_proc 6 0-128 The number of receive descriptors to process before replenishing these to the hardware adapter Feature Code: 2986 (codename Candlestick) 3Com 3C905-TX-IBM Fast EtherLink XL NIC Parameter Default Range Comment Notes -------------- ------- ----- ------------------------------------------- --------- tx_wait_q_size 32 4-128 Driver TX Waiting Queue Size HW Queues rx_wait_q_size 32 4-128 Driver RX Waiting Queue Size HW Queues Feature Code: 2742 (codename Honeycomb) SysKonnect PCI FDDI Adapter (48110040) Parameter Default Range Comment Notes --------- ------- ----- ------------------------------------------- -------- tx_queue_size 30 3-250 Transmit Queue Size SW Queue RX_buffer_cnt 42 1-128 Receive frame count Rcv buffer pool Feature Code: 2979 (codename Skyline) IBM PCI Tokenring Adapter (14101800) Parameter Default Range Comment Notes --------- ------- ----- ------------------------------------------- -------- xmt_que_size 96 32-2048 TRANSMIT queue size SW Queue rx_que_size 32 32-160 HARDWARE RECEIVE queue size HW queue Feature Code: 2979 (codename Cricketstick) IBM PCI Tokenring Adapter (14103e00) Parameter Default Range Comment Notes --------- ------- ----- ------------------------------------------- -------- xmt_que_size 512 32-2048 TRANSMIT queue size SW Queue rx_que_size 64 32-512 RECEIVE queue size HW Queue Feature Code: 2988 (codename Lumbee) IBM PCI 155 Mbps ATM Adapter (14107c00) Parameter Default Range Comment Notes --------- ------- ----- ------------------------------------------- -------- sw_txq_size 100 0-4096 Software Transmit Queue size SW Queue rv_buf4k_min 48 (0x30) 0-512 (x200) Minimum 4K-byte pre-mapped receive buffers
The TCP protocol includes a mechanism for both ends of a connection to negotiate the maximum segment size (MSS) to be used over the connection. Each end uses the OPTIONS field in the TCP header to advertise a proposed MSS. The MSS that is chosen is the smaller of the values provided by the two ends.
The purpose of this negotiation is to avoid the delays and throughput reductions caused by fragmentation of the packets when they pass through routers or gateways and reassembly at the destination host.
The value of MSS advertised by the TCP software during connection setup depends on whether the other end is a local system on the same physical network (that is, the systems have the same network number) or whether it is on a different, remote, network.
If the other end is local, the MSS advertised by TCP is based on the MTU (maximum transfer unit) of the local network interface:
TCP MSS = MTU - TCP header size - IP header size.
Since this is the largest possible MSS that can be accommodated without IP fragmentation, this value is inherently optimal, so no MSS tuning is required for local networks.
When the other end is on a remote network, TCP in AIX defaults to advertising an MSS of 512 bytes. This conservative value is based on a requirement that all IP routers support an MTU of at least 576 bytes.
The optimal MSS for remote networks is based on the smallest MTU of the intervening networks in the route between source and destination. In general, this is a dynamic quantity and could only be ascertained by some form of path MTU discovery. The TCP protocol does not provide any mechanism for doing this, which is why a conservative value is the default. However, it is possible to enable the TCP PMTU discovery by using the command:
no -o tcp_pmtu_discover=1"
A normal side effect of this setting is to see the routing table increasing (one more entry per each active TCP connection). The no option "route_expire" should be set to a non null value, in order to have any non used cached route entry removed from the table, after "route_expire" time of inactivity
While the conservative default is appropriate in the general Internet, it can be unnecessarily restrictive for private internets within an administrative domain. In such an environment, MTU sizes of the component physical networks are known and the minimum MTU and optimal MSS can be determined by the administrator. AIX provides several ways in which TCP can be persuaded to use this optimal MSS. Both source and destination hosts must support these features. In a heterogeneous, multi-vendor environment, the availability of the feature on both systems may determine the choice of solution.
The default MSS of 512 can be overridden by specifying a static route to a specific remote network and using the -mtu option of the route command to specify the MTU to that network. In this case, you would specify the actual minimum MTU of the route, rather than calculating an MSS value.
In a small, stable environment, this method allows precise control of MSS on a network-by-network basis. The disadvantages of this approach are:
The default value of 512 that TCP uses for remote networks can be changed via the no command by changing the tcp_mssdflt parameter. This change is a systemwide change.
The value specified to override the MSS default should be the minimum MTU value less 40 to allow for the normal length of the TCP and IP headers.
In an environment with a larger-than-default MTU, this method has the advantage that the MSS does not need to be set on a per-network basis. The disadvantages are:
Several physical networks can be made to share the same network number by subnetting (see "TCP/IP Addressing"). The no option subnetsarelocal specifies, on a system-wide basis, whether subnets are to be considered local or remote networks. With subnetsarelocal=1 (the default), Host A on subnet 1 considers Host B on subnet 2 to be on the same physical network.
The consequence of this is that when Host A and Host B establish a connection, they negotiate the MSS assuming they are on the same network. Each host advertises an MSS based on the MTU of its network interface. This usually leads to an optimal MSS being chosen.
This approach has several advantages:
The disadvantages are:
In this scenario, Hosts A and B would establish a connection based on a common MTU of 4352. A packet going from A to B would fragmented by Router 1 and defragmented by Router 2, and the reverse would occur going from B to A.
At the IP layer, the only tunable parameter is ipqmaxlen, which controls the length of the IP input queue discussed earlier (which exists only in AIX Version 3). Packets may arrive very quickly and overrun the IP input queue. In the AIX operating system, there is no simple way to determine if this is happening. However an overflow counter can be viewed using the crash command. To check this value, start the crash command and when the prompt appears, type knlist ipintrq. This command returns a hexadecimal value, which may vary from system to system. Next, add 10 (hex) to this value, and then use it as an argument for the od subcommand. For example:
# crash > knlist ipintrq ipintrq: 0x0149ba68 > od 0149ba78 1 0149ba78: 00000000 <-- This is the value of the IP input queue overflow counter >quit
If the number returned is greater than 0, overflows have occurred. The maximum length of this queue is set using the no command. For example:
no -o ipqmaxlen=100
allows 100 packets to be queued up. The exact value to use is determined by the maximum burst rate received. If this cannot be determined, using the number of overflows can help determine what the increase should be. No additional memory is used by increasing the queue length. However, an increase may result in more time spent in the off-level interrupt handler, since IP will have more packets to process on its input queue. This could adversely affect processes needing CPU time. The tradeoff is reduced packet dropping versus CPU availability for other processing. It is best to increase ipqmaxlen by moderate increments if the tradeoff is a concern.
Ethernet is one of the contributors to the "least common denominator" algorithm of MTU choice. If a configuration includes Ethernets and other LANs, and there is extensive traffic among them, the MTUs of all of the LANs may need to be set to 1500 bytes to avoid fragmentation when data enters an Ethernet.
The default MTU of 1492 bytes is appropriate for Token Rings that interconnect to Ethernets or to heterogeneous networks in which the minimum MTU is not known.
The default MTU of 1492 bytes is appropriate for Token Rings that interconnect to Ethernets or to heterogeneous networks in which the minimum MTU is not known.
Despite the comparatively low MTU, this high-speed medium benefits from substantial increases in socket buffer size.