[ Next Article | Previous Article | Book Contents | Library Home | Legal | Search ]
System Management Concepts: Operating System and Devices

Understanding Data Compression

The journaled file system (JFS) supports fragmented and compressed file systems. Both types of file systems save disk space by allowing a logical block to be stored on the disk in units or "fragments" smaller than the full block size of 4096 bytes. In a fragmented file system, only the last logical block of files no larger than 32KB are stored in this manner, so that fragment support is only beneficial for file systems containing numerous small files. Data compression, however, allows all logical blocks of any-sized file to be stored as one or more contiguous fragments. On average, data compression saves disk space by about a factor of two.

The use of fragments and data compression does, however, increase the potential for fragmentation of the disk's free space. Fragments allocated to a logical block must be contiguous on the disk. A file system experiencing free space fragmentation may have difficulty locating enough contiguous fragments for a logical block's allocation, even though the total number of free fragments may exceed the logical block's requirements. The JFS alleviates free space fragmentation by providing the defragfs utility which "defragments" a file system by increasing the amount of contiguous free space. This utility can be used for fragmented and compressed file systems. The disk space savings gained from fragments and data compression can be substantial, while the problem of free space fragmentation remains manageable.

Data compression in the current JFS is compatible with previous versions of AIX. The application programming interface (API) comprised of all the system calls remains the same in both versions of the JFS.

For more information on fragment support, disk utilization, free space fragmentation, and the performance costs associated with fragments, refer to "Understanding Fragments and a Variable Number of I Nodes" .

Data Compression Implementation

Attention: The root file system (/) must not be compressed.

Attention: Compression /usr file system is not recommended because installp must be able to do accurate size calculations for updates and new installs. See the Implicit Behavior section below for more information on size and calculations.

Data compression is an attribute of a file system which is specified when the file system is created with the crfs or mkfs command. Compression only applies to regular files and long symbolic links in such file systems. Fragment support continues to apply to directories and meta-data that are not compressed. Each logical block of a file is compressed by itself before being written to the disk. Compression in this manner facilitates random seeks and updates, while losing only a small amount of freed disk space in comparison to compressing data in larger units.

Once compressed, a logical block usually requires less than 4096 bytes of disk space. The compressed logical block is written to the disk and allocated only the number of contiguous fragments required for its storage. If a logical block does not compress, then it is written to disk in its uncompressed form and allocated 4096 bytes of contiguous fragments.

Implicit Behavior

Since a program that writes a file does not expect an out-of-space (ENOSPC) condition to occur after a successful write (or successful store for mapped files), it is necessary to guarantee that space be available when logical blocks are written to the disk. This is accomplished by allocating 4096 bytes to a logical block when it is first modified so that there will be disk space available even if the block does not compress. If a 4096-byte allocation is not available, the system returns an ENOSPC or EDQUOT error condition even though there may be enough disk space to accommodate the compressed logical block. Premature reporting of an out-of-space condition is most likely when operating near disk quota limits or with a nearly full file system.

In addition to incurring a premature out-of-space error, compressed file systems may exhibit the following behavior:

Specifying Compression

The crfs, mkfs, and lsfs commands have been extended for data compression. These commands as well as the System Management Interface Tool (SMIT) now contain options for specifying or identifying data compression.

Identifying Data Compression

The -q option of the lsfs command displays the current value for compression.

Compatibility and Migration

Previous versions of AIX are compatible with the current JFS. Disk image compatibility is maintained with previous versions of AIX, so that file systems can be mounted and accessed without requiring disk migration activities or losing file system performance.

Backup/Restore

While backup and restore sequences may be performed from compressed to noncompressed file systems or between compressed file systems with different fragment sizes, due to the enhanced disk utilization of compressed file systems, restore operations may fail due to a shortage of disk space. This is of particular interest for full file system backup and restore sequences and may even occur when the total file system size of the target file system is larger than that of the source file system.

Compression Algorithm

The compression algorithm is an IBM version of LZ. In general, LZ algorithms compress data by representing the second and later occurrences of a given string with a pointer that identifies the location of the string's first occurrence and its length. At the beginning of the compression process, no strings have been identified, so at least the first byte of data must be represented as a "raw" character requiring 9-bits (0,byte). Once a given amount of data is compressed, say N bytes, then the compressor searches for the longest string in the N bytes that matches the string starting at the next unprocessed byte. If the longest match has length 0 or 1, the next byte is encoded as a "raw" character. Otherwise, the string is represented as a (pointer,length) pair and the number of bytes processed is incremented by length. Architecturally, IBM LZ supports values of N of 512, 1024, or 2048. IBM LZ specifies the encoding of (pointer,length) pairs and of raw characters. The pointer is a fixed-length field of size log2 N, while the length is encoded as a variable-length field.

Performance Costs

Because data compression is an extension of fragment support, the performance costs associated with fragments also apply to data compression. Compressed file systems also affect performance in the following ways:


[ Next Article | Previous Article | Book Contents | Library Home | Legal | Search ]