AIX National Language Support (NLS) facilitates the use of AIX in various language environments. Because informed use of NLS is increasingly important in obtaining optimum performance from the system, a brief review of NLS is in order.
NLS allows AIX to be tailored to the individual user's language and cultural expectations. A locale is a specific combination of language and geographic or cultural requirements that is identified by a compound name, such as en_US (English as used in the United States). For each supported locale, there is a set of message catalogs, collation value tables, and other information that defines the requirements of that locale. When AIX is installed, the system administrator can choose what locale information should be installed. Thereafter, the individual users can control the locale of each shell by changing the LANG and LC_ALL variables.
The one locale that does not conform to the structure just described is the C (or POSIX) locale. The C locale is the system default locale unless the user explicitly chooses another. It is also the locale in which each newly forked process starts. Running in the C locale is the nearest equivalent in AIX to running in the original, unilingual form of UNIX. There are no C message catalogs. Instead, programs that attempt to get a message from the catalog are given back the default message that is compiled into the program. Some commands, such as the sort command, revert to their original, character-set-specific algorithms.
Our measurements show that the performance of NLS falls into three bands. The C locale is generally the fastest for the execution of commands, followed by the single-byte (Latin alphabet) locales such as en_US, with the multibyte locales resulting in the slowest command execution.
Historically, the C language has displayed a certain amount of provinciality in its interchangeable use of the words byte and character. Thus, an array declared char foo[10] is an array of 10 bytes. But not all of the languages in the world are written with characters that can be expressed in a single byte. Japanese and Chinese, for example, require two or more bytes to identify a particular graphic to be displayed. Therefore, in AIX we distinguish between a byte, which is 8 bits of data, and a character, which is the amount of information needed to represent a single graphic.
Two characteristics of each locale are the maximum number of bytes required to express a character in that locale and the maximum number of output display positions a single character can occupy. These values can be obtained with the MB_CUR_MAX and MAX_DISP_WIDTH macros. If both values are 1, the locale is one in which the equivalence of byte and character still holds. If either value is greater than 1, programs that do character-by-character processing, or that keep track of the number of display positions used, will need to use internationalization functions to do so.
Since the multibyte encodings consist of variable numbers of bytes per character, they cannot be processed as arrays of characters. To allow efficient coding in situations where each character has to receive extensive processing, a fixed-byte-width data type, wchar_t, has been defined. A wchar_t is wide enough to contain a translated form of any supported character encoding. Programmers can therefore declare arrays of wchar_t and process them with (roughly) the same logic they would have used on an array of char, using the wide-character analogs of the traditional libc functions. Unfortunately, the translation from the multibyte form in which text is entered, stored on disk, or written to the display, to the wchar_t form, is computationally quite expensive. It should only be performed in situations in which the processing efficiency of the wchar_t form will more than compensate for the cost of translation to and from the wchar_t form.
It is possible to write a slow, multilingual application program if the programmer is unaware of some constraints on the design of multibyte character sets that allow many programs to run efficiently in a multibyte locale with little use of internationalization functions. For example:
if (strcmp(foostr,"a rose") == 0)we are not looking for "a rose" by any other name; we are looking for that set of bits only. If foostr contains "a rosE" we are not interested.
if (strcoll(foostr,barstr) > 0)and pay the performance cost of obtaining the collation information about each character.
setlocale(LC_ALL, "");to switch to the locale of its parent process before calling any internationalization function.
The command sequence:
LANG=C export LANG
sets the default locale to C (that is, C is used unless a given variable, such as LC_COLLATE, is explicitly set to something else).
The sequence:
LC_ALL=C export LC_ALL
forcibly sets all the locale variables to C, regardless of previous settings.
For a report on the current settings of the locale variables, type locale.
AIX Resource Management Overview.
sort command.