As with other network services, problems can occur on machines that use the Network File System (NFS). Troubleshooting for these problems involves understanding the strategies for tracking NFS problems, recognizing NFS-related error messages, and selecting the appropriate solutions. When tracking down an NFS problem, isolate each of the three main points of failure to determine which is not working: the server, the client, or the network itself.
Note: See "Troubleshooting the Network Lock Manager" for file lock problems.
When the network or server has problems, programs that access hard-mounted remote files fail differently from those that access soft-mounted remote files.
If a server fails to respond to a hard-mount request, NFS prints the message:
NFS server hostname not responding, still trying
Hard-mounted remote file systems cause programs to hang until the server responds because the client retries the mount request until it succeeds. You should use the -bg flag with the mount command when performing a hard mount so that if the server does not respond, the client will retry the mount in the background.
If a server fails to respond to a soft-mount request, NFS prints the message:
Connection timed out
Soft-mounted remote file systems return an error after trying unsuccessfully for a while. Unfortunately, many programs do not check return conditions on file system operations, so you do not see this error message when accessing soft-mounted files. However, this NFS error message will print on the console.
If a client is having NFS trouble, do the following:
/usr/bin/rpcinfo -p server_nameIf the server is up, a list of programs, versions, protocols, and port numbers is printed, similar to the following:
program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100005 1 udp 1025 mountd 100001 1 udp 1030 rstatd 100001 2 udp 1030 rstatd 100001 3 udp 1030 rstatd 100002 1 udp 1036 rusersd 100002 2 udp 1036 rusersd 100008 1 udp 1040 walld 100012 1 udp 1043 sprayd 100005 1 tcp 694 mountd 100003 2 udp 2049 nfs 100024 1 udp 713 status 100024 1 tcp 715 status 100021 1 tcp 716 nlockmgr 100021 1 udp 718 nlockmgr 100021 3 tcp 721 nlockmgr 100021 3 udp 723 nlockmgr 100020 1 udp 726 llockmgr 100020 1 tcp 728 llockmgr 100021 2 tcp 731 nlockmgrIf a similar response is not returned, log in to the server at the server console and check the status of the inetd daemon by following the instructions in "Get the Current Status of the NFS Daemons" .
/usr/bin/rpcinfo -u server_name mount /usr/bin/rpcinfo -u server_name portmap /usr/bin/rpcinfo -u server_name nfsIf the daemons are running at the server, the following responses are returned:
program 100005 version 1 ready and waiting program 100000 version 2 ready and waiting program 100003 version 2 ready and waitingThe program numbers correspond to the commands, respectively, as shown in the example output above. If a similar response is not returned, log in to the server at the server console and check the status of the daemons by following the instructions in "Get the Current Status of the NFS Daemons" .
showmount -e server_nameThis command will list all the file systems currently exported by the server_name.
When an application program writes data to a file in an NFS-mounted file system, the write operation is scheduled for asynchronous processing by the biod daemon. If an error occurs at the NFS server at the same time that the data is actually written to disk, the error is returned to the NFS client and the biod daemon saves the error internally in NFS data structures. The stored error is subsequently returned to the application program the next time it calls either the fsync or close functions. As a consequence of such errors, the application is not notified of the write error until the program closes the file. A typical example of this event is when a file system on the server is full, causing writes attempted by a client to fail.
The following sections explain error codes that can be generated while using NFS.
Insufficient transmit buffers on your network can cause the following error message:
nfs_server: bad sendreply
To increase transmit buffers, use the Web-based System Manager fast path, wsm devices, or the System Management Interface Tool (SMIT) fast path, smit commodev. Then select your adapter type, and increase the number of transmit buffers.
A remote mounting process can fail in several ways. The error messages associated with mounting failures are as follows:
If you issue the mount command with either a directory or file system name but not both, the command looks in the /etc/filesystems file for an entry whose file system or directory field matches the argument. If the mount command finds an entry such as the following:
/dancer.src: dev=/usr/src nodename = d61server type = nfs mount = false
then it performs the mount as if you had entered the following at the command line:
/usr/sbin/mount -n dancer -o rw,hard /usr/src /dancer.src
Check the spelling and the syntax in your mount command. If the command is correct, your network does not run NIS, and you only get this message for this host name, check the entry in the /etc/hosts file.
If your network is running NIS, make sure that the ypbind daemon is running by entering the following at the command line:
ps -ef
You should see the ypbind daemon in the list. Try using the rlogin command to log in remotely to another machine, or use the rcp command to remote-copy something to another machine. If this also fails, your ypbind daemon is probably stopped or hung.
If you only get this message for this host name, you should check the /etc/hosts entry on the NIS server.
If you cannot log in to the server remotely with the rlogin command but the server is up, you should check the network connection by trying to log in remotely to some other machine. You should also check the server's network connection.
You can get a list of the server's exported file systems by running the following command at the command line:
showmount -e hostname
If the file system you want is not in the list, or your machine name or netgroup name is not in the user list for the file system, log in to the server and check the /etc/exports file for the correct file system entry. A file system name that appears in the /etc/exports file, but not in the output from the showmount command, indicates a failure in the mountd daemon. Either the daemon could not parse that line in the file, it could not find the directory, or the directory name was not a locally mounted directory. If the /etc/exports file looks correct and your network runs NIS, check the server's ypbind daemon. It may be stopped or hung. For more information, see AIX Version 4.3 Network Information Services (NIS and NIS+) Guide.
Check the server's /etc/exports file, and, if applicable, the ypbind daemon. In this case you can just change your host name with the hostname command and retry the mount command.
If access to remote files seems unusually slow, ensure that access time is not being inhibited by a runaway daemon, a bad tty line, or a similar problem.
At the server, enter the following at the command line:
ps -ef
If the server seems fine and other users are getting timely responses, make sure your biod daemons are running. Try the following steps:
If they are not running or are hung, continue with steps 2 and 3.
stopsrc -x biod -c
startsrc -s biod
To determine if the biod daemons are hung, run the ps command as above, copy a large file from a remote system, and then run the ps command again. If the biod daemons do not accumulate CPU time, they are probably hung.
If the biod daemons are working, check the network connections. The nfsstat command determines whether you are dropping packets. Use the nfsstat -c and nfsstat -s commands to determine if the client or server is retransmitting large blocks. Retransmissions are always a possibility due to lost packets or busy servers. A retransmission rate of 5% is considered high.
The probability of retransmissions can be reduced by changing communication adapter transmit queue parameters. The System Management Interface Tool (SMIT) can be used to change these parameters.
The following values are recommended for NFS servers.
Communication Adapter Maximum Transmission Unit (MTU) and Transmit Queue Sizes | ||
Adapter | MTU | Transmit Queue |
Token Ring 4Mb 16Mb |
1500 3900 1500 8500 |
50 40 (Increase if the nfsstat command times out.) 40 (Increase if the nfsstat command times out.) |
Ethernet | 1500 | 40 (Increase if the nfsstat command times out.) |
The larger MTU sizes for each token-ring speed reduce processor use and significantly improve read/write operations.
Notes:
To set MTU size, use the Web-based System Manager fast path, wsm network, or the SMIT fast path, smit chif. Select the appropriate adapter and enter an MTU value in the Maximum IP Packet Size field.
The ifconfig command can be used to set MTU size (and must be used to set MTU size at 8500). The format for the ifconfig command is:
ifconfig trn NodeName up mtu MTUSize
where trn is your adapter name, for example, tr0.
Another method of setting MTU sizes combines the ifconfig command with SMIT.
Communication adapter transmit queue sizes are set with SMIT. Enter the smit chgtok fast path, select the appropriate adapter, and enter a queue size in the Transmit field.
If programs hang during file-related work, the NFS server could have stopped. In this case, the following error message may be displayed:
NFS server hostname not responding, still trying
The NFS server (hostname) is down. This indicates a problem with the NFS server, the network connection, or the NIS server.
Check the servers from which you have mounted file systems if your machine hangs completely. If one or more of them is down, do not be concerned. When the server comes back up, your programs continue automatically. No files are destroyed.
If a soft-mounted server dies, other work is not affected. Programs that time out trying to access soft-mounted remote files fail with the errno message, but you will still be able to access your other file systems.
If all servers are running, determine whether others who are using the same servers are having trouble. More than one machine having service problems indicates a problem with the server's nfsd daemons. In this case, log in to the server and run the ps command to see if the nfsd daemon is running and accumulating CPU time. If not, you may be able to stop and then restart the nfsd daemon. If this does not work, you have to reboot the server.
Check your network connection and the connection of the server if other systems seem to be up and running.
Sometimes, after mounts have been successfully established, there are problems in reading, writing, or creating remote files or directories. Such difficulties are usually due to permissions or authentication problems. Permission and authentication problems can vary in cause depending on whether NIS is being used and secure mounts are specified.
The simplest case occurs when nonsecure mounts are specified and NIS is not used. In this case, user IDs (UIDs) and group IDs (GIDs) are mapped solely through the server and clients /etc/passwd and /etc/group files, respectively. In this scheme, for a user named john to be identified both on the client and on the server as john, the user john in the /etc/passwd file must have the same UID number. The following is an example of how this might cause problems:
User john is uid 200 on client foo. User john is uid 250 on server bar. User jane is uid 200 on server bar.
The /home/bar directory is mounted from server bar onto client foo. If user john is editing files on the /home/bar remote file system on client foo, confusion results when he saves files.
The server bar thinks the files belong to user jane, because jane is UID 200 on bar. If john logs on directly to bar by using the rlogin command, he may not be able to access the files he just created while working on the remotely mounted file system. jane, however, is able to do so because the machines arbitrate permissions by UID, not by name.
The only permanent solution to this is to reassign consistent UIDs on the two machines. For example, give john UID 200 on server bar or 250 on client foo. The files owned by john would then need to have the chown command run against them to make them match the new ID on the appropriate machine.
Because of the problems with maintaining consistent UID and GID mappings on all machines in a network, NIS or NIS+ is often used to perform the appropriate mappings so that this type of problem is avoided. See AIX Version 4.3 Network Information Services (NIS and NIS+) Guide for more information.
When an NFS server services a mount request, it looks up the name of the client making the request. The server takes the client Internet Protocol (IP) address and looks up the corresponding host name that matches that address. Once the host name has been found, the server looks at the exports list for the requested directory and checks the existence of the client's name in the access list for the directory. If an entry exists for the client and the entry matches exactly what was returned for the name resolution, then that part of the mount authentication passes.
If the server is not able to perform the IP address-to-host-name resolution, the server denies the mount request. The server must be able to find some match for the client IP address making the mount request. If the directory is exported with the access being to all clients, the server still must be able to do the reverse name lookup to allow the mount request.
The server also must be able to look up the correct name for the client. For example, if there exists an entry in the /etc/exports file like the following:
/tmp -access=silly:funny
the following corresponding entries exist in the /etc/hosts file:
150.102.23.21 silly.domain.name.com 150.102.23.52 funny.domain.name.com
Notice that the names do not correspond exactly. When the server looks up the IP address-to-host-name matches for the hosts silly and funny, the string names do not match exactly with the entries in the access list of the export. This type of name resolution problem usually occurs when using the named daemon for name resolution. Most named daemon databases have aliases for the full domain names of hosts so that users do not have to enter full names when referring to hosts. Even though these host-name-to-IP address entries exist for the aliases, the reverse lookup may not exist. The database for reverse name lookup (IP address to host name) usually has entries containing the IP address and the full domain name (not the alias) of that host. Sometimes the export entries are created with the shorter alias name, causing problems when clients try to mount.
On systems that use NFS Version 3.2, users cannot be a member of more than 16 groups without complications. (Groups are defined by the groups command.) If a user is a member of 17 or more groups, and the user tries to access files owned by the 17th (or greater) group, the system will not allow the file to be read or copied. To permit the user access to the files, rearrange the group order.
When mounting a file system from a pre-Version 3 NFS server onto a Version 3 NFS client, a problem occurs when the user on the client executing the mount is a member of more than eight groups. Some servers are not able to deal correctly with this situation and deny the request for the mount. The solution is to change the user's group membership to a number less than eight and then retry the mount. The following error message is characteristic of this group problem:
RPC: Authentication error; why=Invalid client credential
Some NFS commands do not execute correctly if the NFS kernel extension is not loaded. Some commands with this dependency are: nfsstat, exportfs, mountd, nfsd, and biod. When NFS is installed on the system, the kernel extension is placed in the /usr/lib/drivers/nfs.ext file. This file is then loaded as the NFS kernel extension when the system is configured. The script that does this kernel extension loads the /etc/rc.net file. There are many other things done in this script, one of which is to load the NFS kernel extension. It is important to note that Transmission Control Protocol/Internet Protocol (TCP/IP) kernel extension should be loaded before the NFS kernel extension is loaded.
Note: The gfsinstall command is used to load the NFS kernel extension into the kernel when the system initially starts. This command can be run more than once per system boot and it will not cause a problem. The system is currently shipped with the gfsinstall command used in both the /etc/rc.net and /etc/rc.nfs files. This is correct. There is no need to remove either of these calls.