Document OALWP01950320
Document Id: OALWP01950320
Date Loaded: 03-21-95
Description: HP-UX 10.0 HFS File System White Paper
HP-UX 10.0 HFS File System White Paper
HP 9000 Series 700/800 Computers
March 1995, First Edition
LEGAL NOTICES
The information in this document is subject to change without notice.
Hewlett-Packard makes no warranty of any kind with regard to this
manual, including, but not limited to, the implied warranties of
merchantability and fitness for a particular purpose.
Hewlett-Packard shall not be held liable for errors contained herein
or direct, indirect, special, incidental or consequential damages in
connection with the furnishing, performance, or use of this material.
Warranty. A copy of the specific warranty terms applicable to
your Hewlett-Packard product and replacement parts can be obtained
from your local Sales and Service Office.
Restricted Rights Legend. Use, duplication, or disclosure by
the U.S. Government Department is subject to restrictions as set
forth in subparagraph (c) (1) (ii) of the Rights in Technical Data and
Computer Software clause at DFARS 252.227-7013 for DOD agencies, and
subparagraphs (c) (1) and (c) (2) of the Commercial Computer Software
Restricted Rights clause at FAR 52.227-19 for other agencies.
HEWLETT-PACKARD COMPANY
3000 Hanover Street
Palo Alto, California 94304 U.S.A.
Use of this manual and flexible disk(s) or tape cartridge(s) supplied
for this pack is restricted to this product only. Additional copies
of the programs may be made for security and back-up purposes only.
Resale of the programs in their present form or with alterations, is
expressly prohibited.
Copyright Notices. (C)copyright 1983-95 Hewlett-Packard Company,
all rights reserved.
Reproduction, adaptation, or translation of this document without prior
written permission is prohibited, except as allowed under the copyright laws.
(C)copyright 1979, 1980, 1983, 1985-93 Regents of the University
of California
This software is based in part on the Fourth Berkeley Software
Distribution under license from the Regents of the University of
California.
(C)copyright 1980, 1984, 1986 Novell, Inc.
(C)copyright 1986-1992 Sun Microsystems, Inc.
(C)copyright 1985-86, 1988 Massachusetts Institute of Technology.
(C)copyright 1989-93 The Open Software Foundation, Inc.
(C)copyright 1986 Digital Equipment Corporation.
(C)copyright 1990 Motorola, Inc.
(C)copyright 1990, 1991, 1992 Cornell University
(C)copyright 1989-1991 The University of Maryland.
(C)copyright 1988 Carnegie Mellon University.
Trademark Notices. UNIX is a registered trademark in the United States and
other countries, licensed exclusively through X/Open Company Limited.
X Window System is a trademark of the Massachusetts Institute of
Technology.
MS-DOS and Microsoft are U.S. registered trademarks of Microsoft
Corporation.
OSF/Motif is a trademark of the Open Software Foundation, Inc. in the
U.S. and other countries.
First Edition: March 1995 (HP-UX Release 10.0)
==============================================================================
HP-UX 10.0 HFS File System
==========================
The predominant file system used by HP-UX is called the High Performance
File System (HFS), which is also known as the McKusick (or BSD) file
system. This white paper describes the structure of the file system
and its relationship to the disks on which file systems reside.
The following additional resources are useful in gaining further understanding
of the HP-UX file systems and how to administer them:
* HP-UX System Administration Tasks manual, for creating
and managing file systems and disk space.
* Section (4) of the HP-UX Reference, for specifications of file-system
formats.
* HP-UX 10.0 Documentation Map, identifying additional
sources of information.
* Other white papers, available from SupportLine, file-system subjects.
To work effectively with file systems, you must understand their
interrelationship with physical disks. Every file of the HFS file
system is stored on a formatted mass storage medium, a disk. The disk
is known to HP-UX by specifying the path name to the disk's device file.
Device drivers in the operating system enable communication to the disk.
Each architecture supports a different set of disks, based on the device
drivers written for that architecture and disk. To access files in a
file system, you mount the file system on a disk, by associating the
path name of its mount point to the disk's device file. Once mounted,
the file system is accessible to the operating system and users.
This paper discusses file-system creation, storage, modification, and
protection.
Understanding File-System Creation
==================================
As a system administrator, much of what you do concerns file systems.
System files, application files, and user files are typically organized
as file systems. Also, although disks are the storage devices that hold
data, the data must reside in a file system to be available to the
operating system. Thus, if you run short of space, you can install a
new disk and create a file system on it to hold additional data.
Conceptually, the creating a file system involves:
* making the physical environment (the disk device) available to
the file system.
* creating the software entity (the file system) itself.
* establishing (by mounting) the "connective threads" between the
physical and software elements.
HP-UX uses the term "file system" to mean several things: A file system
is the HP-UX file-system (often several file systems mounted together)
directory tree, starting from root. File system is also a body of
structures that exist on each file-system device that enables you to
keep data contiguous with the existing data hierarchy. This second
meaning of file system is the subject of this white paper. This section
summarizes the numerous aspects of file system creation, to explain how
a file system is connected to HP-UX as a whole.
Note:
All procedures for creating and maintaining file systems are
found in HP-UX System Administration Tasks manual.
There are many reasons why you might add a new file system, including:
* You anticipate that your file system will soon exceed
current maximum capacity.
* Your current file system has already reached maximum capacity.
* You wish to separate portions of a file system physically, to
restrict growth of files on a portion of the file system or
to increase concurrent access for better performance.
To create a file system, you can use a sequence of HP-UX commands, or
you can invoke the SAM utility and perform the task interactively. In
either case, adding a file system involves:
* Installing the necessary device files for the new device
(done if disk is newly connected)
* Preparing the storage medium (the disk device) for the file system
(if disk is newly connected)
* Creating the file system itself.
* Mounting the file system to make it available for system use.
* Adding the file system to /etc/fstab for automatic mounting.
If you are creating your new file system on a new disk drive, you first
connect the physical device to the system, referring to the device's
installation manual. Use a hard disk to hold an HP-UX file system. The
capacity of flexible disks, cartridge and reel tape drives is too
limited, slow, and subject to deterioration from such constant use.
Rewritable magneto-optical disks are slower than hard disks, but
substantially faster than flexible disks or tape, and are typically used
to back up a file system. If necessary, magneto-optical disks can be
used to hold an auxiliary file system.
Each disk is accessed physically via a compatible interface card that
connects the disk to the computer's bus architecture. Hard disk drives
might use any of the following interfaces -- standard or high-speed
HP-IB, fiber link (HP-FL), or small computer systems interface (SCSI).
The protocol for each interface is encoded in a specific device driver,
which must be present for HP-UX to communicate with the disk.
The operating system accesses physical devices logically through both
the device driver and device special files.
* You can see the device drivers used by your system by reading
the /stand/system file or by running the lsdev(1M) command.
* You can see device special files for disks by listing the
/dev/dsk (for block special files) and /dev/rdsk (for character
special files) directories.
Create device files using mknod(1M), mksf(1M), or insf(1M). Character and
block device special files are required for devices that hold file systems.
If you are apportioning disk space using the Logical Volume Manager (LVM),
you need a character and block device special file for each logical volume.
Without LVM on Series 700 systems, you need a character and block device
special file for the entire disk drive. Using disk sections on a Series 800,
you need a character and block device special file for each section used.
The device special files are used when performing system
administration tasks involving the file system. For example,
* The mediainit(1M) command requires a transparent special file
to reformat a disk or tape for a file system. Use mediainit
if you suspect the media is corrupted or worn. To use mediainit,
you must create the device files using the -t option of mksf(1M).
* The mount(1M) requires block device files to mount and
unmount (umount) the file system.
* The newfs(1M) command requires a character special file to
create a file system.
HP-UX cannot use media to store files until you place a file system on
it. You can create a new file system using SAM, mkfs(1M), or newfs(1M).
Of the two manual commands, newfs is easier to use. When you create a
file system, you create an environment to contain files, much like
building a "file cabinet" for paper files. When first built, the file
cabinet is empty. Then you add files.
To create a file system, you specify the disk special file to newfs;
newfs queries the device driver, which returns information that
newfs can then use to set disk characteristics and key values, including block
and fragment size, number of bytes per inode, percentage of reserved free
space, and rotational delay.
Procedures for building a file system are documented
in HP-UX System Administration Tasks manual, Chapter 4.
After creating a file system, the file system has to be mounted
(attached) to the HP-UX file hierarchy, using the mount(1M) command.
This incorporates the file system into the existing file system's
overall hierarchy. You do this by logically associating the root
directory of the new file system with a mount point, a directory on the
existing file system. Once a file system is mounted, the mount points
are seamless. You can access the new mounted disk space as a contiguous
part of the entire HP-UX file-system hierarchy, as shown in the
following figure.
File System /users Mounted to Root File System /dev/dsk/c1t4d0 at /home
+--------------------------------------+
| / |
| | |
| +--------------+---------------+ |
root file system | | | | |
/dev/dsk/c1t4d0 | bin usr home |
| | |
+---------------------------------|----+
|
+---------------------|--------------------+
| | |
| +---------+---------+---------+ |
file system | | | | | |
/users | beth jo amy meg |
| |
+------------------------------------------+
Once mounted, user jo's pathname is /home/jo, but when you run bdf,
you will see the file system /users mounted to /home.
To mount a file system:
* Make a mount point directory (using the mkdir command)
for the file system.
* Mount the newly created file system to the mount point
(using the mount command).
An existing file system can be moved to a different location on the HP-UX
file hierarchy by unmounting (detaching) it from its current location using
the -u option (or umount command) of mount(1M) and remounting the file
system. A file system cannot be unmounted if any files are open or if any
user's current working directory is in that file system. You can use the
fuser(1M) command to identify which processes are using a file system or
file structure, and if necessary, terminate them. The shutdown command
unmounts all mounted file systems before bringing a system down, so that
the file systems are not corrupted.
You cannot unmount the root file system or any file system that has
dynamic swap enabled. Likewise, be sure that the /stand and /sbin
directories are part of the root file system, so that they cannot be
inadvertently unmounted. (Directories such as /var, /opt, and /usr are
made to be mountable.)
For mounting, you refer to the file system by its logical volume and its
mount point directory. For unmounting, you refer to the file system by
either the device file name or mount point, because unmounting breaks
the link between the two.
As a system administrator, you maintain the /etc/fstab file as a record
of mountable file systems and swap space. The /etc/fstab file is read:
* by /sbin/init.d/hfsmount, to mount all listed file systems when the
system boots up.
* by fsck(1M), to determine the order for conducting file-system checks.
* by shutdown(1M), to unmount all file systems before halting the system.
* by library calls such as getfsent(3X) and getmntent(3X),
which enable programs to make use of file system information.
Disk Layout
===========
The disk layout is the geometry applied to a physical disk. Typically,
a disk is divided into areas that accommodate file systems or raw I/O,
dump, and swap. A disk from which the system can be booted is called a
root disk, is organized somewhat differently from other disks, and discussed
later in this paper. Non-root disks typically contain a single swap area,
file systems, or a combination of both. The following sections discuss
layout principles of HP-UX disks for each architecture.
Logical Volume Manager (LVM) is the recommended method of apportioning disk
space on both Series 700 and Series 800.
Logical Volumes
_______________
* LVM enables you to partition disks flexibly. You combine one or
more disks (called physical volumes) into a volume group, which
can then be subdivided into logical volumes.
* The size of logical volumes can be defined according to need.
You can extend or reduce the size of logical volume as needs change.
* Logical volumes can span disks. This enables you to create
very large logical volumes, or use small portions of disk
space more efficiently.
* You can mirror logical volumes, using an optional product,
MirrorDisk/UX.
Procedures for using Logical Volume Manager (LVM) are documented in
"HP-UX System Administration Tasks."
Note:
Software Disk Striping(SDS), which had been a Series 700-only
feature, is no longer supported on 10.0. Instead, you need to
convert the disk to 10.0 LVM. LVM provides comparable striping capability
for both Series 700 and 800, using the lvcreate command with -i and -I
options. See lvcreate(1M) and lvextend(1M) in the HP-UX Reference.
Series 700 Disk Layout
______________________
The first 8 KB of the Series 700 disk is used for the LIF directory,
which contains pointers to the file system and each boot program in
the boot area. In the absence of a boot area, the swap area occupies the
remaining space. The Series 700 boot program occupies the last 2 MB of the
root disk layout.
Series 700 Disk Layout
----------------------
Area Data Structure Size
---- -------------- ----
Boot pointers LIF directory 8 KB
File system Superblocks 8 KB
and Dynamic Swap (primary and redundant)
Cylinder group 1 varies
Cylinder group 2
...
Cylinder group n
Swap Swap tables 0 or more blocks
(defined in
/usr/include/sys/swap.h)
Boot area LIF file system 2 MB
(optional)
Series 800 Disk Layout
______________________
For backward compatibility, Series 800 disks can be apportioned in sections
(also called partitions). Using LVM is the recommended method, however, and
you are encouraged to convert your disks to LVM.
Disk space can be partitioned on the Series 800 in a variety of ways.
Each section can be addressed like separate disk drives.
A section can used for:
* Boot area
* File system
* Swap area
* Raw I/O
The layout of each section is nearly identical to the same areas on
the Series 700. However, the boot and swap areas reside in their own
sections instead of residing in the same section as the file system.
Series 800 disks can be partitioned into sixteen possible section
choices. The size and location of each hard-coded section, as
shown below, is dependent on the disk model.
Disk Sections and Relative Sizes
--------------------------------
# -----------------------------------------
# 6 ^ ^
# ---------------------- | --- |
# 2 15 | 7 ^ |
# ------------- | | |
# --------- ^ | | |
# 1 14 | v | |
# ------------------ | -------- | | 0
# 10 | ^ | |
# ------------------ | | | |
# 3 ^ | 13 | | |
# ----------- | | | 11 | 12 |
# 4 ^ | 8 | | | |
# ------ | 9 | | | | |
# 5 v v v v v v
# -----------------------------------------
Limited information on section sizes and locations are defined in the
/etc/disktab file (maintained only for backward compatibility). If you are
managing a disk using hard-coded sections, when you create a new file system
(with mkfs, newfs or SAM), you declare on what section the file system is to
be mounted. You must be careful not to use overlapping disk sections.
File System Size
================
HP-UX supports file systems up to 4 GB; however, the size limit for
individual files is 2 GB. Applications may also not use raw access to
disk sections larger than 2 GB. For very large disks (such as HP C2254B),
the boot partition must lie within 2 GB of the beginning of the disk.
Protocols do not permit NFS-mounting file systems larger than 2 GB.
Disk and File System Tools
==========================
When working with file systems, you often have to understand how
much disk space you have and how large your file systems are.
/usr/sbin/diskinfo can help you determine available disk space. To view how
large a file system is that you want to mount, you can use bdf, df or
du. For backward compatibility, /etc/disktab provides some information
about disk geometry. Each tool is discussed in the next sections.
Disk Characteristics Command -- /usr/sbin/diskinfo
__________________________________________________
The diskinfo(1M) command displays characteristics of a disk device,
when given the device's character special file. /usr/sbin/diskinfo
is particularly useful when setting up or managing logical volumes.
When used without options, /usr/sbin/diskinfo produces terse output:
% /usr/sbin/diskinfo /dev/rdsk/c2t5d0
SCSI describe of /dev/rdsk/c2t5d0:
vendor: HP
product id: C3010
type: direct access
size: 1956086 Kbytes
bytes per sector: 512
With the -b option, /usr/sbin/diskinfo returns the size of the disk in
1024-byte sectors.
% /usr/sbin/diskinfo -b /dev/rdsk/c2t5d0
1956086
The verbose (-v) option of /usr/sbin/diskinfo displays different
information, depending on type of disk:
* vendor and product ID (SCSI devices)
* device name (CS/80 and SCSI)
* number of bytes/sector (CS/80 and SCSI)
* geometry, interleave, and timing information (CS/80)
* size in bytes and logical blocks, revision level, SCSI
conformance level (SCSI)
For example,
% /usr/sbin/diskinfo -v /dev/rdsk/c2t5d0
SCSI describe of /dev/rdsk/c2t5d0:
vendor: HP
product id: C3010
type: direct access
size: 1956086 Kbytes
bytes per sector: 512
rev level: 0BQ3
blocks per disk: 3912172
ISO version: 0
ECMA version: 0
ANSI version: 2
removable media: no
response format: 2
Free Disk Blocks Command -- bdf
_______________________________
The bdf command (Berkeley's variation of df) reports the number of
free disk blocks available on a file system. If no file system is
given as an argument, bdf reports on all file systems.
Several options are available:
-b Displays information about file system swapping.
-i Displays used and free inodes.
-l Local. Displays HFS file systems mounted on a client.
Does not display NFS-mounted file systems.
-t type Displays only information on mounted file systems of a given type.
Here is sample output of bdf:
% bdf
Filesystem kbytes used avail %used Mounted on
/dev/vg00/lvol1 47829 19886 23160 46% /
/dev/vg00/lvol8 34541 8260 22826 27% /var
/dev/vg00/lvol7 299157 157561 111680 59% /usr
/dev/vg00/lvol6 23013 3576 17135 17% /tmp
/dev/vg00/lvol5 99669 11100 78602 12% /opt
/dev/vg00/lvol4 19861 9 17865 0% /home
bdf reports its output in 1024-byte blocks.
df reports its output in 512-byte blocks.
Disk Usage Command -- du
________________________
The du command reports disk usage in 512-byte blocks for all
files or directories specified; if none is specified, du reports on the
current directory. Its report traverses the file tree recursively.
Here is sample output using du on a subdirectory of one of the file
systems listed in the previous example:
% du /var/sam
4 /var/sam/preferences
10 /var/sam/log
2 /var/sam/lock
2 /var/sam/rt
142 /var/sam
The final number reported is the total of 512-byte blocks
for the /usr/contrib file system, and therefore the number
is twice as large as that reported by bdf in 1024-byte blocks.
Note:
If it encounters a protected directory (that is, one whose file
permissions are set to prevent access), du cannot report the
number of blocks contained in that directory or its subdirectories.
Disk Geometry Database -- /etc/disktab
______________________________________
Note:
/etc/disktab is provided for backward compatibility only.
Do not rely on it for current information; newfs now determines the
geometry requirements of disks when it creates a file system.
The /etc/disktab file is a database and informational file about disks,
that provides reference about the many HP disks supported on
a given computer system and tutorial information about disk geometry.
Because /etc/disktab is a database, its information appears in terse form,
as follows:
ty Type of disk.
ns Number of 1K sectors per track.
nt Number of tracks per cylinder.
nc Total number of cylinders per disk.
s0 Size of file system in 1K blocks.
b0 Block size in bytes. (Default block size for all systems is 8K.)
f0 Fragment sizes in bytes. (Default fragment size for all systems is 1K.)
se Number of bytes per physical sector.
rm Rotational speed of disk platters by revolutions per minute.
Not all abbreviations are used on all systems.
The contents of /etc/disktab are used if you construct a file system with
newfs -O. /etc/disktab provides entries that enable you to specify whether
you want portions of a disk used for swap and boot.
newfs no longer shows Series 800 disk sections.
If you are using the LVM, you have even less cause to refer to /etc/disktab,
although you might refer to it when you want to use non-default settings
for file-system specification (for example, to change the fragment size,
customize the various file-system sizes). Before adding a physical
volume (disk) to a volume group, you might consult /etc/disktab to get
an idea of the disk size. For full specifications, see disktab(4) in the
HP-UX Reference.
Boot Area
_________
The boot area is the portion of the disk that holds the code used to
bring the system into an operational state. The boot code initializes
and tests the hardware, then loads into memory a secondary loader. The
secondary loader is the program that loads /stand/vmunix (the operating
system) into memory to enable you to use your system. (For detailed
information about boot code, see the white paper entitled, "System
Startup.")
The boot area is reserved on the mass storage medium (usually a disk)
during the installation process. Information in the boot area is used
only if the disk is used for booting (boot disk), but the space can be
reserved on all disks.
Although the disk layouts for HP-UX platforms differ, all systems use a small
file system for the initial system booting, written in Logical Interchange
Format (LIF). (LIF is described in lif(4) in the HP-UX Reference.
The manual page also contains pointers to the LIF utilities.)
Using LVM, the boot data is contained in a Boot Data Reserved Area, which
is created using the pvcreate -B command.
If the system is administered without LVM, the boot area on a Series 700
precedes the file system on the disk. On Series 800 systems using traditional
disk sections, the boot area must reside in its own disk section distinct
from the file system and swap area sections.
Series 700 Boot Area Implementation
-----------------------------------
The Series 700 loader understands the layout of the file system. The
Series 700 boot area has pointers to the actual bootstrap programs.
The lifls command on a Series 700 reports presence of FS, SWAP, ISL,
AUTO, HPUX, IOMAP, EST, and PAD files. Its reportage of FS and SWAP
indicates that Series 700 LIF has knowledge of the entire disk,
including the file system and swap. When the system is booted, the
loader can find /stand/vmunix at a default or designated location, using
the boot console user interface.
For more information, see the owner's guide for the Series 700 systems
or hpux(1M) in the HP-UX Reference.
Series 800 Boot Area Implementation
-----------------------------------
On Series 800, the LIF header contains ISL, HPUX, AUTO, RDB, and IOMAP
files. ISL uses the AUTO file to locate the HP-UX kernel.
Primary Swap Area
_________________
The primary swap area is a contiguous area of the root disk used by the
virtual memory system (see white paper on Memory Management) to
temporarily store a process image. The primary swap area is specified
in /etc/fstab. Until /sbin/rc1.d/S500swap_start executes swapon,
primary swap is your only swap device.
Device swap space is used for primary swap, because the system can access it
directly, without having to go through a file system.
On systems using LVM, the primary swap area resides in the root volume
group in a designated logical volume. You can set up multiple swap areas
in logical volumes that are on separate disks (physical volumes).
On Series 700 systems, the primary swap area occupies
blocks after a file system area or an entire disk dedicated as
a swap disk. If you have multiple disks, each one can contain its own
swap area, but there is still only one primary swap area on
the entire system.
On Series 800 systems using disk sections, primary swap space occupies its own
section, separate from the file system and boot area sections. A single disk
should not have multiple swap sections, because performance will degrade as
the system attempts to do interleaved writes to swap areas on separate areas
of the disk. Instead, configure multiple swap areas on separate disks.
(For discussion of interleaving, see the Memory Management white paper.)
You can list all swap areas on your system using the swapinfo(1M)
command; see the HP-UX Reference. Procedures for managing swap space
are found in the "HP-UX System Administration Tasks" manual.
File System Layout
==================
With the exception of disk drives used for raw data, every disk drive
contains some file systems. All HFS file systems are laid out in a common
format, with the following structures:
* Primary superblock
* Multiple cylinder groups
The many data structures governing the superblock and
cylinder group are defined in several header files, particularly
/usr/include/sys/fs.h. Superblock headers, defined in fs.h, also include
absolute disk addresses for the first boot block and
definitions of numerous file system attributes, including
cylinder-group characteristics (such as rotational positions, number
of inodes per group, number fragments per block), file length,
and mirror states of root and primary swap.
For description of file-system format, see fs(4) of the HP-UX Reference.
The Superblock
______________
The superblock is a contiguous, 8-KB block of disk space near the
beginning of the file system's disk section. The superblock contains a
record of the static information about the state of the file system at
the time of its creation (or extension, if using LVM):
* file system size
* number of inodes it can store
* locations of free space on the file system
* number of cylinder groups
* location of superblocks, cylinder groups, inode blocks, and
data blocks
* size and number of blocks and fragments.
The primary superblock also keeps track of file system update information
in its summary information area. HP-UX uses information in the superblock
for various file system maintenance procedures -- for example, when you
mount a file system or perform a file system check by executing fsck.
Because the superblock is so important, HP-UX always keeps redundant
copies on disk in each cylinder group. One copy is brought into main
memory when you boot up. A primary superblock is at the beginning of
the file system, and each cylinder group has a copy of the superblock.
This redundancy further ensures the integrity of file system data.
The non-redundant superblocks on the disk are updated whenever the sync
command is executed and when a file system is unmounted (see sync(1M) in
the HP-UX Reference).
Record of all superblock locations can be found in /var/adm/sbtab.
The Cylinder Group
__________________
The cylinder group is the term used to describe a further
internal organization of disk layout.
Picture a set of disks stacked on top of one another, rotating around
the same single point. One movable arm for each disk in the set extends
from outside the edge of the disk toward the center of rotation. All
the arms are tethered together so that they move in unison. At the end
of each arm (toward the central point) is a read/write head that can
access any point on the disk surface.
A cylinder is a collection of tracks located the same distance from the
edge of a disk platter, accessable by the read/write head. Since all the
tracks in a cylinder are accessed by the read/write heads of the disk drive
simultaneously, the blocks of space on each track can be accessible with
minimum rotational latency; that is, requiring no seek time.
For performance reasons, small groups of adjacent cylinders (sixteen by
default; see newfs_hfs(1M)) are grouped together as cylinder groups. Each
cylinder group has its own set of inodes and local mappings of free
space in the group. This internal organization results in both bringing
to closer proximity file-system inodes and their associated data without
long seeks and dispersing data and inodes across cylinders. Minimum
time is lost seeking file data within a cylinder group.
The cylinder group controls all access to a file and its associated
data. Each cylinder group contains a copy of the superblock, a cylinder
group information structure, an inode table, and data blocks.
Cylinder Group Layout
---------------------
Data Structure Size
-------------- ----
Boot block 8 KB
Primary superblock 8 KB
Redundant superblock 8 KB
Cylinder group information 1 block (4KB or 8KB)
Inode table varies (see Inodes section)
Data blocks 0 or more blocks (due to offset;
see Data Blocks section)
Only the first cylinder group is likely to have a boot block. The
beginning of all subsequent cylinder groups might be filled by data
blocks, depending on offset.
A redundant copy of the superblock is located in each cylinder group.
This ensures that if any single track, cylinder, or platter is damaged,
the file system itself can be repaired by executing fsck and specifying
an alternate superblock. Further, each successive cylinder group is
laid out offset by one track in relation to the previous cylinder group,
so that the redundant copies of the superblock spiral down the platters.
The cylinder group information contains the dynamic parameters of the
cylinder group:
* Number of inodes and data blocks in the cylinder group
* Pointers to the last used block, fragment, and inode
* Number of available fragments
* Used inode map
* Free block map.
The cylinder group information data structure's size is one block (a
block can be defined when running newfs as either 4 KB or 8 KB). The
layout of the cylinder group information is defined in /usr/include/sys/fs.h.
Inodes
------
Besides maintaining information about the file-system state, the cylinder
group holds key information about the file-system inodes -- the system's
index to the actual files of data. Inodes contain the locations of the
actual file data.
The cylinder group maintains an inode table, which provides summary
information about each file in the cylinder group (see the figure, "Mapping
from Inode to File Data Blocks," later in this paper). In addition, the
"disk inodes" appear in an expanded version ("in-core inodes") in memory for
inodes currently (or recently) used. A disk inode includes the following
information:
* mode and file type
* number of links to the file
* owner and group information
* file size in bytes
* time stamps
* pointers to the file's actual blocks of data on disk
When a file is read into memory, its in-core inode also shows the following:
* status of the in-core inode, including if the inode is locked, if a
process is waiting for the inode, if the disk inode now differs from
the in-core copy due to file modification, if the file is a mount point.
* numeric address of the file system containing the file.
* inode array number by which the kernel identifies the disk inode.
* pointers to other in-core inodes linked on buffer hash and free lists.
The /usr/include/sys/inode.h header file defines the in-core inode;
the /usr/include/sys/ino.h header file defines the disk inode.
When the operating system accesses a file, it finds the file using the inode
pointers to the file blocks of data. This is discussed later in this paper.
A static number of inodes is allocated for each cylinder group when the file
system is created. HFS uses a default that provides sufficient inodes per
cylinder group for average usage. If the file described by the inode is not
a regular file, some of the inode fields differ as follows:
* FIFO and pipes:
The space reserved for indirect block pointers contains information
about the current state of a FIFO or pipe.
* Character or block device files:
The first direct block address is actually the major and minor number
of the device. The rest of the direct block addresses are 0.
* Directories:
The pointers point to regular file system data blocks that contain
specially formatted data described in /usr/include/sys/dir.h.
When you create a file system (using newfs or mkfs), the system creates
inodes. The number of created inodes limits the number of files that you can
have in a file system. Each time you create a file, an inode is allocated
for that file. Both commands default to 6144 bytes per inode, meaning the
system assumes that the average size of your files will be 6144 bytes.
Although uncommon, an inode error message, inode: table is full,
might require changing the size of the inode table. This message
refers to the kernel's in-core inode parameter. A configurable
parameter, ninode, defines the maximum number of open, in-core inodes.
You can use SAM to change these configurable parameters.
Data Blocks
-----------
Disk space before or after the superblock, cylinder group information, and
inode table is filled with data blocks. (The specific locations of data on
each platter is different, due to the cylinder-block offset.) The blocks are
used to store data for regular files, directories, and symbolic links.
HP-UX provides support for file systems in several block sizes:
8 KB, 16 KB, 32 KB, or 64 KB.
Block size is set using the mkfs or newfs command, when you construct
a file system. See mkfs(1M) and newfs(1M) in the HP-UX Reference.
Larger block sizes are faster for sequential access to the file system,
while smaller block sizes use space more efficiently and are better for
random I/O. Having a large block size has both benefits and costs. For
big files, a large block size significantly reduces the number of disk
accesses, thereby increasing file system throughput. The problem is
that most HP-UX files are small; thus, using a large block size for
small files might waste space.
In the fs.h header file, the size of blocks is referred to as fs_bsize,
depending upon what block size your file system uses.
Fragment size is specified at file system creation. To minimize wasted space,
fragments can be one-eighth, one-fourth, one-half or the same size as a block.
A block can be divided into 1 KB, 2 KB, 4 KB, or 8 KB fragments.
How a File is Accessed from Inode to Data Blocks
================================================
Inode in the cylinder group contains pointers to the locations of a
file's actual data. Depending on the size of a file, its data might be
reached through pointers to direct blocks or indirect blocks, which
are pointers to a block containing more pointers to the data. HP-UX
allows for up to triple indirect pointing for enormous files.
The next figure shows the mapping from an inode to a file's data blocks.
The first 12 pointers in an inode point directly to the first 12 blocks
or fragments containing the file's data. If the file is larger than
12 blocks (greater than 12 times fs_bsize, indirect reference is made to
the file's data. A group of 4-byte long indirect pointers is contained in
one data block; there can be either 1024 pointers (4096/4) or 2048 pointers
(8192/4) in each block of indirect pointers.
The thirteenth block address in the inode points to a block containing
1024 or 2048 additional pointers to data blocks. The number of indirect
pointers in a block is called num_ip. Thus, the thirteenth (single indirect)
block address handles files up to 4,243,456 bytes in a 4-KB block file system
or 16,875,520 bytes in an 8-KB block file system (fs_bsize times (12+num_ip)).
If the file is larger, the fourteenth inode block address points to
num_ip indirect blocks, each of which contains pointers to an additional
num_ip actual data blocks. If the file cannot be contained in this space,
the fifteenth inode block address points to num_ip double-indirect blocks.
With the fifteenth (triple-indirect) block address, the size of a file
is limited to fs_bsize times (12+num_ip+(num_ip squared) + num_ip cubed).
Mapping from Inode to File Data Blocks
--------------------------------------
inode 1st level 2nd level file contents here
+-------------------+ indirection indirection
| mode & file type | +---+ +---+ +---+
+-------------------+ | | | | ... | |
|# links to file | +---+ +---+ +---+
+-------------------+ ^ ^ ^
|owner, group info | | | |
+-------------------+ | | |
| file size in bytes| | | |
+-------------------+ | | |
| time stamps | | | |
+-------------------+ | | |
| direct 1 |------------------------------------+ | |
| blocks 2 |------------------------------------------+ |
| ... | |
| 12 |----------------------------------------------------+
+-------------------+ +---+ +---+ +---+
|single indirect |-------+ | | | | ... | |
+-------------------+ | +---+ +---+ +---+
|double indirect |-+ v ^ ^ ^
+-------------------+ | +-----+ | | |
|triple indirect | | | 1 |-------------------------+ | |
| | | | 2 |-------------------------------+ |
+-------------------+ | | ... | |
| |1K or| |
| | 2K* |-----------------------------------------+
| +-----+ +---+ +---+ +---+
| | | | | ... | |
| +---+ +---+ +---+
| +-----+ +-----+ ^ ^ ^
+->| 1 |----->| 1 |------------+ | |
| 2 | | 2 |------------------+ |
| ... | | ... | |
|1K or| |1K or| |
| 2K* |-+ | 2K* |----------------------------+
+-----+ | +-----+ +---+ +---+ +---+
| | | | | ... | |
| +---+ +---+ +---+
| +-----+ ^ ^ ^
+-> | 1 |------------+ | |
| 2 |------------------+ |
| ... | |
|1K or| |
| 2K* |----------------------------+
+-----+
* 1K pointers if file-system block size = 4KB
2K pointers if file-system block size = 8KB
Inode pointers hold the address of a fragment. The address references
an entire block or one or more fragments, depending on the number of
bytes stored at the address. All blocks but the last have a full block
of data allocated to them. If the amount of data in the last block is
less than the file system block size, only the number of consecutive
fragments needed to actually store the actual data are allocated. For
example, in an 8-KB/1-KB file system, a 15-KB file is stored as 2 8-KB
blocks and 3 consecutive 1-KB fragments. (The latter might also be
referred to as a 3-KB fragment.) This allocation scheme provides the
performance advantage of large blocks with the space savings of small
fragments.
The next figure shows an example of a 20-KB file stored in 8-KB blocks
with 1-KB fragments. The number of blocks needed is 20/8 (file
size/block size): 2 full blocks with a remainder of 4 fragments.
Therefore, the first and second pointers point to full blocks, but the
third pointer points to the remaining 4 fragments.
Sample Inode Addressing
-----------------------
Inode
+-------+
| ... |
+-------+
file| 20K | +------+ +------+ +-------+
size| | +------+ +------+ +--|----+
+-------+ 8 15 24 31 40 43 48
| ... | ^ ^ ^
+-------+ | | |
1| 8 |-----+ | |
direct +-------+ | |
blocks 2| 24 |-------------------+ |
+-------+ |
3| 43 |-------------------------------------+
+-------+
4| 0 |
+-------+
...| |
+-------+
12| 0 |
+-------+
...| |
+-------+
All indirect blocks are referenced only as full blocks; no pieces of the
file are addressed at the fragment level beyond the 12 direct pointers.
When a block or fragment is needed, the disk is searched for free blocks.
Ideally, free blocks should be found throughout the disk, for searches to
locate a free block close to related blocks. When the file system is full,
there are long linear searches to find the block, and when a block is
allocated, it is likely to be placed far from the previous block of the
file, resulting in long seeks and slow performance.
Minimum Free Space
__________________
To ensure the availability of free blocks near one another, a certain
percentage of free space must always be available in the file system.
This minimum free space percentage is specified at file system creation
using the -m option of the newfs command or the minfree argument of the
mkfs command. The default is 10 percent. Values lower than 10 percent
may severely degrade system performance, by causing the file system to
search harder for free space.
The percent of free space can be changed at any time using tunefs -m.
The reserved free space is inaccessible to the normal user; once this
threshold is met, only the superuser can continue to allocate blocks.
When the percentage of free space drops below the threshold, system
throughput (to and from newly created files) drops because the file system
can no longer localize the blocks for a file. Accessing a file is quicker
if the entire file is grouped together.
How Disk Space is Allocated
===========================
Free space availability is determined from a bit map associated
with each cylinder group. The bit map contains one bit for each fragment.
To determine if a block is available, the system examines consecutive
fragments. A piece of the bit map from a file system using 1024-byte
fragments and 8192-byte blocks is shown next.
Sample Free Block Bitmap in an 8KB/1KB File System
--------------------------------------------------
bit map 00000000 00000011 11111100 11111111
Fragment numbers 0-7 8-15 16-23 24-31
Block numbers 0 1 2 3
Fragment numbers 14-21 and 24-31 in this example are free, indicated by ones
in the bit map. Fragment numbers 0-13 and 22-23 are allocated, as indicated
by zeroes in the bit map. Fragments in adjacent blocks cannot be used to
create a full block; only eight contiguous fragments starting on a block
boundary can be used to allocate a full block. Fragments 24-31 can be
coalesced to form a full block, but not fragments 14-21. Also, if a
partial block is allocated, the fragments must be consecutive and not
cross a block boundary. For example, if three fragments are needed,
fragments 16-18 can be allocated, but not fragments 14-16.
Every time data is written to an existing file, the system checks to see
if file size must increase. If so, one of three conditions exists:
* Sufficient space exists in the existing block or fragment;
the new data is written into the already allocated space.
* The file contains only whole blocks; the last block contains
insufficient space to hold additional data. If more than a full
block of data must be written, a new block is allocated and written.
This process is repeated until less than a full block of new data is
needed. At that point, a block containing enough contiguous fragments
is located and the new data is written there.
* The file contains fragments, but not enough to hold the new data. If
the size of the existing data in fragments plus the new data exceeds
the size of a full block, a new block is allocated. Both the old and
new data are written to the new block. If the size of the old and new
data is less than a full block, a block with enough contiguous
fragments (or a full block) is located and allocated.
When a block or fragment has been located, the address is recorded
in the inode table and the free block bit map is updated.
Allocation Policies
___________________
Allocation is performed globally to place
new directories and files and locally to place data in blocks.
A global decision determines which cylinder group contains a given file or
directory. HP-UX attempts to put all files from a single directory in the
same cylinder group. Newly created directories are put in the cylinder group
with the greatest number of free inodes and smallest number of directories.
Once the file size reaches maxbpg (maxbpg is defined via the tunefs command),
HP-UX allocates blocks from another cylinder group. This helps to enforce
grouping of all files within one directory into a single cylinder group by
spreading the less common larger files over several cylinder groups.
Global allocation routines call local allocation routines with requests
for specific data blocks. Blocks are allocated by the following priorities:
* Allocate block requested.
* Allocate a block on the same cylinder that is rotationally
closest to the requested block.
* Allocate any block within the same cylinder group.
* Use a quadratic hash to find a new cylinder group; allocate a
block somewhere in the new cylinder group.
* Use sequential search to find an available block.
Speed in allocating blocks is the most important characteristic of this
strategy. For this reason, the percentage of free space must be maintained.
The File-System Buffer Cache
============================
The file-system buffer cache manages data flow between main memory and
secondary memory (principally disks), by temporarily holding (buffering)
information about data being transferred to and from disk. The buffer
cache speeds data transfer from the file system to main memory; once
buffered, data is accessed by a process's executing space in main
memory much faster than from the file system on disk. The buffer
cache is used for all file system I/O operations, plus all other block
I/O operations in the system (for example, mount, inode reading, LVM
management, and some device drivers).
The role of the buffer cache is illustrated below. When you execute a
program, the shell passes the file path name to exec, finds the file on disk,
and reads the a.out header into the buffer cache. The a.out header contains
preliminary information about the executable, including the size of the text
and the uninitialized data (bss areas).
Buffer Cache Holds the a.out Header of Executing Programs
---------------------------------------------------------
Secondary Storage Main Memory
+--------------------+ +-------------------------+
| | | buffer cache |
| program file ++ |------------------>| containing a.out header |
| ++ | | |
| |<----------------->| program executable |
+--------------------+ +-------------------------+
As the code executes, the virtual-memory system reads the pages of
data directly from the disk into memory. (Some additional pages might
also be read in, based on the probability they will be needed). The file's
a.out, which is only needed to begin the "demand-paging", might (or might not)
remain in the buffer cache throughout the process execution, depending on
whether its buffer is needed.
If you have just created and compiled a program, all transactions occur
from the buffer cache. For an existing program, however, data might exist on
both disk and buffer cache. When a page is faulted in from disk to memory,
HP-UX also ensures that the process executes using the most current copy of
the data. During a file-system write, HP-UX ensures that only the most current
copy of the data, whether in the virtual-memory system's page cache or in the
buffer cache, is written to disk.
Structure of the Buffer Cache
-----------------------------
The buffer cache consists of two parts:
* buffer headers, which have pointers to the buffer and describe
its contents.
* buffer data area, which reside in data blocks ranging in size
from DEV_BSIZE to MAXBSIZE.
Like a file system, a buffer must a always be some multiple of DEV_BSIZE.
MAX_BSIZE is the largest buffer size, in bytes. The smallest unit of memory
assigned to a buffer is one page.
The data structures used for allocating and managing buffers are defined
in the /usr/include/sys/buf.h header file.
Requests for buffers come from many sources, including file-system reads and
writes, and device driver allocations. If a buffer is requested and not
already in the cache, the operating system obtains the buffer header,
allocates memory for the pages of the buffer, and then gives it to the part
of the operating system making the request.
Implementation of the Buffer Cache
----------------------------------
The HP-UX file-system buffer cache can be implemented in two ways:
* Dynamically.
The dynamic buffer-cache implementation allows the buffer cache to
change in size depending on system demand for virtual memory vs.
buffer cache. As of HP-UX release 10.0, the buffer cache is
implemented dynamically, by default. Instead of setting fixed values
using the familiar nbuf and bufpages parameters (both nbuf and bufpages
are now set to zero), the operating system uses two new parameters,
set as a percentage of physical memory. By default, dbc_min_pct is set
to 5% of physical memory; dbc_max_pct is set to 50% of physical memory.
These percentages can be changed to as low as 2% or as high as 90%,
respectively.
* Fixed.
The number of buffers in the cache is set by two operating-system
parameters in the /stand/system file -- nbuf and bufpages. When you
power up your system, these parameters reserve memory for buffer
headers (nbuf) and for pages of memory for buffer-cache use (bufpages)
based upon the amount of available RAM. Of the two parameters,
bufpages is more critical, defining the amount of memory in buffer
cache, which can vary depending on block size. If either nbuf or
bufpages is set to a value other than zero, a fixed buffer cache
is implemented.
You can use SAM to change the buffer-cache operating system parameters
(dbc_min_pct, dbc_max_pct, nbuf, bufpages,) and then reboot to implement
the changes. Since the values are stored in /stand/system, you can edit
the file to assign the values, but the SAM method is recommended. For further
information, refer to the SAM online help and "HP-UX System Administration
Tasks" manual.
Implementation of a Dynamic Buffer Cache
----------------------------------------
From a system-administration perspective, using the dynamic buffer cache is
simple: the operating system is shipped with it set up by default.
The size of the buffer cache is determined by two parameters (dbc_min_pct (5%)
and dbc_max_pct (50%)), which are set in SAM. The dynamic buffer cache begins
at the dbc_min_pct value and can grow to dbc_max_pct value, as the I/O
requests occur. When memory pressure occurs, the cache can shrink to
a minimum of dbc_min_pct.
Although the nbuf and bufpages operating-system parameters are not specified
in /stand/system, the operating system determines how many buffer-cache pages
are needed for optimal system performance. You can choose to configure these
parameters, but if you do, the buffer cache will not function dynamically.
With dynamic buffer cache, nbuf is set to one-half bufpages (that is,
half the minimum percent; by default, 2 1/2%.) The number of buffer
headers (bufpages) does not change.
The dynamic buffer cache is implemented to grow and shrink in size, depending
on operating-system and virtual-memory need. Demand for memory is generated
not only by the file system, but also by other objects, including processes,
data regions, memory-mapped files.
Both buffer cache and virtual-memory subsystem access the same body of RAM
in main memory. The dynamic buffer cache is allowed to grow considerably
larger than a fixed buffer cache, permitting more data to be held in memory.
When the virtual-memory system requires more memory, the dynamic buffer cache
is reduced to yield memory for processes.
The dynamic buffer cache functions like a large memory-mapped file shared
among all the processes running on the system. (Note, there are a number of
subtle interactions between the buffer cache and memory-mapped files that can
streamline bringing data into the virtual-memory subsystem.
The dynamic buffer cache uses an algorithm based on two free lists
LRU (least recently used) and EMPTY (unallocated buffer headers) for
reusing existing buffer pages and allocating more pages from memory.
The LRU lists buffers in most-recently to least-recently used order.
This list may grow as long as the buffer cache is growing. When a
buffer is read for the first time, its buffer is inserted mid-list, in
fairly high priority. If accessed again, its priority is increased.
Other buffers might decrease in priority (such as file-system writes to
an entire block, which typically do not get referenced again).
The dynamic buffer cache shrinks by use of vhand, the virtual-memory
subsystem's pageout daemon. vhand reclaims pages of memory from the buffer
cache as well as virtual memory, by using reference bits, much as it does
through the virtual memory subsystem's regions. Its first (age) hand clears
the status bits of any buffer pages not recently accessed. If the status
bit remains clear by the time the second (steal) hand traverses it, vhand
reclaims (pages out) the associated page.
The dynamic buffer cache gives the operating system flexibility to accommodate
both small application programs that do a lot of I/O and large programs that
do little I/O but require many pages of memory for data.
For information on memory-mapped files, and the vhand and swapper
daemons, see the Memory Management white paper.
Implementation of a Fixed Buffer Cache
--------------------------------------
Buffer headers are allocated in a single contiguous block and treated as
an array. Inactive buffer headers are placed on one of three doubly-linked
lists -- LRU (least recently used), AGE, and EMPTY:
* Although its name suggests otherwise, the LRU list actually points to
blocks of most frequently accessed data, representing no more than 40%
of total buffers. If data in a buffer is dirty (that is, its contents
changed since accessed from the file system), its pages must be written
to disk before the pages can be reallocated.
* The AGE list contains buffers accessed less frequently and the
overflow of the LRU list.
* The EMPTY list contains unallocated buffer headers.
If a buffer requires more than one page, HP-UX ensures that the pages are
assigned consecutive addresses.
As code and data move from the file system into the buffer cache, the system
copies the information from the buffer cache into user's main memory. If a
user requests information already in the buffer cache, the information is
copied from the cache to user's main memory, eliminating the I/O operation to
bring it in from disk.
When data is written through the buffer cache, any data in the virtual-memory
system's page cache (in main memory) with the same vnode and block address is
purged. Virtual addresses used by the buffer cache are in kernel space.
When a pagein occurs, both the buffer header (on one of the buffer lists) and
associated data in the buffer cache are flushed.
How the HFS File System Modifies Files
======================================
Every time a file is modified, the HP-UX operating system updates the
file system to ensure its consistency.
When a process updates (writes to) the file system, the data being
written is copied into an in-memory buffer cache. The physical disk is
updated asynchronously from the buffer write. The data and inode information
reflecting the change is written to the disk later, unless the file was
opened in the synchronous mode (see the section on Synchronized I/O Flags in
the open(2) manpage of the HP-UX Reference). The process continues, though
the data has not yet been written to the disk. If the system is halted
without writing the buffer to disk, the file system on the disk is left in an
inconsistent state. Such inconsistencies are flagged and corrected, if
possible, by the fsck command at system startup.
The sync command can be used to force synchronization. The syncer command
routinely updates the file system's superblock, inodes, data blocks, and
cylinder group information, as described below. (See fsck(1M), sync(1M),
and syncer(1M) in the HP-UX Reference.)
Primary Superblock:
The superblock of a mounted file system is written to the disk whenever
a umount command is issued, or when a sync command is issued and the
file system has been modified.
Inodes:
An inode contains information describing the file. The inode is written
to disk after every modification, unless the fs_async parameter is set
in the /stand/system file. (See "fs_async on an HFS File System,"
later in this paper.)
Data blocks:
In-core blocks (including directories, indirect blocks, files, pipes,
symbolic links, and FIFOs) are written to the file system after being
modified and released by the operating system. Upon release, data blocks are
buffered or queued for eventual writing. Physical I/O takes place when the
buffer is needed by HP-UX, when a sync or fsync command is issued, or when
O_SYNC is set for the file. If a file is opened with the O_SYNC or O_SYNCIO
flag set, the write system call does not return until completed.
Cylinder group information:
The cylinder group information is updated whenever a sync is executed,
or when the system needs a buffer and the cylinder group is written.
CAUTION:
* Always unmount a file system BEFORE executing fsck.
* Always reboot the system WITHOUT sync'ing (that is, use
reboot -n) after altering the root device with fsck.
A file system can become inconsistent if you execute fsck
on a mounted file system other than the root file system;
you risk missing buffered information not yet written to the
file system. If this information is then flushed from the buffer
cache, it might overwrite corrections that fsck had made.
Immediate Reporting
___________________
Numerous SCSI disk devices are shipped with a feature called immediate
reporting. Workstation disk devices are set with default ON; multi-user
disks are set with default OFF. Immediate reporting speeds status
notification; its implementation is handled by the disk controller and
disk device. However, immediate reporting also has some associated risks.
With immediate reporting, when a device driver sends a write request to
a device, the device accepts the data, places it in its buffer or its
cache, and reports to the SPU that the write completed successfully.
Without immediate reporting, status is not returned until the data goes
to the media itself.
In a power (or other) failure, data might not have been written
successfully to disk, but in fact, still reside in a buffer. An
application, writing to the raw device or to the files system using
O_SYNC, continues processing as though the data has been written. If
data remains in the buffer at the time of a system failure, the database
is left in an inconsistent state.
Note, however, O_SYNC might cause the driver to attempt to have I/Os
sourced through that open (marked B_SYNC) to be written through the
cache to media by use of a scsi command, Write FUA. Not all devices
support this command, however.
Under rare circumstances, immediate reporting might also cause delayed
errors or system panics. This can occur in the following scenario: A
user has a write request and the system returns good status immediately.
If the next request is a kernel request and an error occurs (such as a
write failure) caused by the user's write request, the error might get
associated with the kernel request. If the kernel request cannot tolerate
the error, the kernel might panic. In other words, the I/O which has
already been reported successful actually fails. This failure is reported
on a subsequent I/O by a "deferred" error. Such erroneous I/Os cannot be
retried, nor reported to the application nor the kernel, since the only
information available to the driver is the report itself. The original I/O
(prematurely reported successful) is long gone, as might the application.
Thus, the system's sole recourse may be to panic.
Immediate reporting can be set or disabled using scsictl(1M). If it is
critical that your system not go down (or cause silent data corruption),
you might want to disable immediate reporting. Although SCSI disks available
for Series 800 systems can be set for immediate reporting, the feature poses
greater risk of inconsistent data; the disks are shipped with the feature
disabled.
fs_async on an HFS File System
______________________________
When HP-UX writes data to disk synchronously, any file-system activity
must complete to the disk before the program is allowed to continue;
the process does not regain control until completion of the physical
I/O (regardless of whether the I/O is user data or operating-system
data). Synchronous writes include some file-system structures and
whatever an application writes with O_SYNC set.
When HP-UX writes to disk asychronously, I/O is scheduled at some later
time and the process regains control immediately, without waiting for
the write to complete. (In the case of a SCSI disk, the data is
actually written to a write cache in the card controller, which as
far to the disk as the operating system can tell.)
By default, some critical changes to the structure of the file system
are posted to disk synchronously. Synchronous writes ensure file system
integrity in case of system crash, but this kind of disk writing also
impedes system performance. Run-time performance increases significantly
on I/O-intensive applications when all disk writes occur asynchronously;
little effect is seen for compute-bound processes. However, if a
system using asynchronous disk writes crashes, recovery might require
system-administrator intervention using fsck and might also cause user
data or directories to disappear.
As a system administrator, you can specify whether some disk writes are
performed synchronously or asynchronously. The fs_async parameter specified
in the /stand/system file enables and disables the feature. (You cannot
modify whether or not other types of disk writes occur synchronously.
They are asynchronous by default and synchronous if synchronous I/O flags
are set by the application.)
On both Series 700 and 800, the fs_async value is set to 0 by default.
This specifies that the writes should be performed synchronously.
Setting fs_async to 1 causes fewer writes to be performed asynchronously.
Typically, this causes file-system performance to improve.
Note too, fs_async, deals with inodes and directories, while O_SYNC deals
with files and data. If a file is opened via O_SYNC, the file continues to
be written synchronously, regardless of what method is specified. O_SYNC
also causes inodes to be updated synchronously. For further information on
synchronous I/O, refer to open(2) in the HP-UX Reference.
Although asynchronous disk writes increases system performance for
most applications, if a system crashes, file-system data structures
are likely to be left in an inconsistent state. For this reason,
we do NOT recommend that you turn on fs_async on a production system.
Normally, file-system recovery is performed automatically by
fsck in the reboot process and does not require any intervention by
the system administrator. However, using asynchronous disk writes
might require system administrator intervention in the event of a crash.
For further information, refer to fsck(1M) in the "HP-UX Reference."
Minimizing File-System Corruption
=================================
Although the HFS file system is very reliable, hardware failures, accidental
power loss, or improper shutdown procedures can cause its corruption.
Problems, such as a bad block on a disk, power loss, or a non-functional disk
controller, can occur and cause the hardware to fail. By following
recommended hardware preventive maintenance procedures and by keeping regular
backups (as defined in the "HP-UX System Administration Tasks" manual), you
can avoid most serious problems and be prepared for any that might occur.
As a system administrator, you are responsible for preserving users' data.
Since the file system is the HP-UX data structure that stores the data, it
is essential that you safeguard the file system by performing maintenance
tasks (such as regular backups), following proper startup and shutdown
procedures, and by checking the file system when necessary using the fsck
command.
System Shutdown and Startup Guidelines
______________________________________
To ensure file system integrity, always follow proper shutdown and
startup procedures (described in the "HP-UX System Administration
Tasks" manual):
* Always shut down the system using reboot or shutdown.
* Never physically write-protect a mounted file system, unless it
is mounted read-only.
* Never take a mounted file system off-line (for example, by
shutting its power off or by disconnecting it) while it is in use.
Follow proper startup procedures:
* Always check the file system for inconsistencies.
(The fsck command runs automatically when the system reboots.)
* Always repair inconsistencies, using fsck. Allowing a corrupted
file system to be further modified in such circumstances can be
disastrous.
The /lost+found Directory
_________________________
Every file system should have a lost+found directory at its root. fsck, the
file system check command (discussed in the next section), places any problem
files or directories in lost+found. After fsck completes, you should
examine each file in lost+found to determine its name and location and
attempt to return it to its rightful place.
lost+found is created by both mkfs and newfs when they create file systems.
However, if your system lacks lost+found, you can rebuild it using
mklost+found(1M). mklost+found creates several empty file slots for fsck.
Understanding Use of fsck to Detect and Correct File-System Corruption
======================================================================
The fsck command is the principal file-system maintenance
tool for checking system consistency and making repairs.
NEVER run fsck on a mounted file system. However, fsck should be run
regularly to ensure the file system's structural integrity:
* fsck is invoked during system boot-up by the /etc/bcheckrc
script run by init.
* For preventative maintenance, fsck should be run weekly
(before each full backup) on all file systems, but particularly
on file systems that have been unmounted.
* You should run fsck any time you suspect problems with the
HP-UX file system. Be sure to unmount the file system first!
In performing its checks, fsck examines the file system several times,
each time examining different characteristics, including:
* Block and file size
* Path names
* Connectivity (parent-child relationships)
* Reference count links
* Cylinder groups
fsck checks intrinsically redundant file-system data. The redundant
data is either read from the file system or computed from known values.
The file system should be in a unmounted state when you check it. The
root file system should only be run from init run-level s, the system
administrator run-level. (Thus, you can check the root file system
after performing a system shutdown.) Do not run fsck for the root file
system when the system is busy. You can check non-root file systems any
time, but be sure they are unmounted.
You can run fsck interactively or non-interactively. When invoked
without options, fsck runs interactively on file systems marked hfs in
/etc/fstab and queries you for a response when it finds an inconsistency.
In non-interactive mode (typically, in the -p or preen mode), fsck reports
inconsistencies, corrects many problems, but does not remove data. If it
cannot solve a problem, fsck terminates. If this happens, you should run
fsck interactively to fix the problems.
Note:
When running fsck -p before a backup, if the command completes
successfully, perform your backup. If it aborts with errors,
back up the bad file system, repair it, then back up the file
system again.
Do not issue the reboot command in its default form after fsck has
repaired a mounted file system. By default, reboot executes sync
on the disks, thus writing out inconsistent data. If you must reboot,
use reboot -n, which does not issue a sync.
For further discussion of fsck, see the fsck white paper and fsck(1M)
in the HP-UX Reference. The following subsections describe the interaction
of fsck on various elements of the file-system.
Superblock Consistency
______________________
The superblock's summary information can become inconsistent because
every change to the file system's blocks or inodes modifies it.
Most often, the superblock and its associated parts become corrupted
when the computer is halted and the last command involving output
to the file system is not a reboot, shutdown, sync, or umount command.
fsck checks the superblock for inconsistencies involving:
* Free block count -- this is fairly common
* Free inode count -- this is fairly common
* File system size -- this rarely happens.
If it detects corruption in the static parameters of the primary (default)
superblock, fsck requests the system administrator to specify the location
of an alternate superblock. The alternate superblock addresses are listed
during file-system creation. An alternate superblock is always found at
block number 16. If this superblock is also corrupted, you must supply the
address of another superblock. If the last time you created a file system
was during the installation, a list of superblock addresses can be found
in the /var/adm/sbtab file.
File System Size
----------------
fsck examines the superblock for inconsistencies involving file system
size, number of inodes, free block count, and the free inode count. The
file system size must be larger than the number of blocks used by the
superblock and the number of blocks used by the list of inodes. The file
system size and layout information are critical pieces of information to the
fsck program. While there is no way to actually check these sizes, fsck can
verify that they are within reasonable bounds. All other checks of the file
system depend on the correctness of these sizes.
Free-Block Checking
-------------------
fsck checks that all data associated with files and directories can be found.
The superblock summary information contains a count of the total number of
free blocks within the file system. fsck checks that all the blocks marked
as free are not claimed by any files. When all the blocks have been accounted
for, fsck compares this count to the number of free blocks it finds within
the file system. If the figures do not agree, fsck replaces the count in the
summary information by the actual free-block count. If any of the free-block
maps is erroneous, fsck rebuilds them, excluding all blocks in the list of
allocated blocks.
Inode Checking
--------------
The superblock summary contains a count of the total number of free inodes
within the file system. fsck compares this count to the number of free inodes
it finds within the file system. If the figures do not agree, fsck replaces
the count in the summary information by the actual free inode count.
Inode Consistency
_________________
Individual inodes are less likely than superblock summary information
to be corrupted. However, because of the great number of active
inodes, it is possible that a few inodes might become corrupted.
The inodes list is checked sequentially, from inode 2 (inode 0 marks unused
directory slots and inode 1 is reserved for future use) to the last inode
in the file system.
The inode structure is defined in the /usr/include/sys/inode.h header
file. There are two major types of inodes: primary and continuation.
Continuation inodes contain only a mode (which is of type continuation),
a link count, and access control list (ACL) entries. Continuation
inodes exist only if a file has optional ACL entries associated with it.
fsck checks the continuation inode's mode, link count, and the reference
from the primary inode. It does not examine the ACL information itself.
fsck checks each primary inode for inconsistencies in the following areas:
* Format and type
* Link count
* Duplicate blocks
* Bad blocks
* Inode size
* Block count
Format and Type
---------------
fsck verifies inodes classifications -- regular file, directory,
block special file, character special file, network device, FIFO,
symbolic link, or continuation inode. It also examines the inode
state, as:
* Unallocated
* Allocated
* Neither unallocated nor allocated
This last state indicates an incorrectly formatted inode. An inode
can get into this state, for example, if bad data is written into the
inode list through a hardware failure. To correct such an ambiguous
state, fsck clears the defective inode.
Link Count
----------
Contained in each inode is a count of the total number of directory entries
linked to the inode. fsck verifies the link count stored in each inode by
traversing the total directory structure (starting from the root directory)
and calculating an actual link count for each inode. If the stored link
count is non-zero and the actual link count is zero, no directory entry
appears for the inode; fsck links the disconnected file to the /lost+found
directory. If the stored and actual link counts are non-zero and unequal,
a directory entry may have been added or removed without the inode being
updated. fsck replaces the stored link count in the inode by the actual
link count.
Duplicate Blocks
----------------
Duplicate blocks can occur from using a file system with blocks claimed
by both the free-block list and other parts of the file system.
Each inode contains a list (or for large files, pointers to lists in
indirect blocks) of all blocks containing its file's data. fsck compares
each block number claimed by an inode to a list of allocated blocks. fsck
updates the list of allocated blocks to include the block number. If a
block number is already claimed by another inode, fsck adds the block number
to a list of duplicate blocks.
To resolve duplicate blocks, fsck makes a partial second pass of the
inode list to find the duplicated blocks' inodes. fsck prompts the
operator to clear both inodes. Often clearing only one inode solves the
problem, but the data in the other inode is suspect.
Bad Blocks
----------
Contained in each inode is a list or pointer to lists of all
the blocks claimed by the inode. fsck checks each block number
claimed by an inode for a value outside the range of the file
system (lower than that of the first data block or greater than the
last block in the file system). If the block number is outside
this range, the block number is a bad block number.
fsck prompts the operator to clear the inode.
LVM provides another mechanism for relocating bad blocks.
(See the Logical Volume Manager documentation.)
Inode Size
----------
Each inode contains a 64-bit (eight-byte) size field indicating the number
of characters in the file associated with the inode. fsck uses the inode
size field to check for size inconsistencies.
fsck calculates the number of blocks that should be claimed by an inode
by dividing the number of characters in the file by the number of
characters per block and rounding up to get the number of direct blocks.
fsck then counts actual direct and indirect blocks associated with the
inode. If the actual number of blocks does not match the computed number
of blocks, fsck warns of a possible file-size error. This is only a warning
because HP-UX does not fill in blocks in sparse data files.
A directory inode within the HP-UX file system has the mode word set to
"directory". The directory size of a file system using 14-character
filename limits must be a multiple of 32 characters, because a directory
entry contains 32 bytes of information. The number of blocks actually
used for the directory should match that indicated by the inode size.
fsck reports any directory misalignment, but cannot correct it.
Block Count
-----------
fsck checks the block count of two types of data blocks:
* Ordinary data blocks containing information stored in a
file. fsck does not attempt to check the validity of the
contents of an ordinary data block.
* Directory data blocks containing directory entries.
Indirect blocks are owned by an inode; thus, inconsistencies in indirect
blocks affect the inode that points to the block. fsck checks indirect blocks
for the following block-count inconsistencies:
* Blocks already claimed by another inode.
* Block numbers outside the range of the file system.
fsck detects and corrects the indirect-block inconsistencies
iteratively, by the same scheme used for direct blocks.
fsck checks each directory data block for inconsistencies involving:
* Directory inode numbers pointing to unallocated inodes.
If a directory entry inode number points to an unallocated
inode, fsck removes the directory entry.
* Directory inode numbers larger than the number of inodes in the
file system.
If a directory entry inode number points beyond the end of the inode list,
fsck removes the directory entry. This occurs if bad data is written into a
directory data block.
* Incorrect directory inode numbers for "." and ".." (current
and parent directories, respectively).
The directory inode number entry for "." should be the first
entry in the directory data block. Its value should be equal to
the inode number for the directory data block.
The directory inode number entry for ".." should be the second
entry in the directory data block. Its value should be equal to
the inode number for the parent of the directory entry (or the
inode number of the directory data block if the directory is
the root directory).
If the directory inode numbers for "." and ".." are incorrect,
fsck replaces them with correct values.
File-System Connectivity
________________________
fsck checks the general connectivity of the file system. If it finds
directories not linked into the file system, fsck links the directory
back into the file system by placing them in the /lost+found directory.
Uncorrectable File System Corruption
____________________________________
In certain instances, fsck may be unable to check and repair the file system
(for example, if all copies of the superblock are lost). The fsdb
(file system debugger) command is provided for such situations.
CAUTION:
fsdb should be used ONLY by an HP-UX file system expert, since it
can easily destroy the entire file system. Refer to the fsdb(1M)
entry in the HP-UX Reference for details.
Transferring Files between HP-UX and Other Systems
==================================================
Not all computers use the HFS File System. To accommodate variances,
HP-UX supports several utilities and services for transferring files,
including to other vendors' operating systems. The following listing shows
what to use when transferring information between HP-UX and various systems.
In some cases (such as with networking products), optional products must
be present.
Utilities and Services for File Transfer
----------------------------------------
kermit:
Use when both systems, connected by serial lines, run kermit. kermit
transfers data between HP-UX and many incompatible operating systems.
For more info:
* Kermit Mailer
* Using C-Kermit, Columbia University, Digital Press
LIF Utilities:
Use when transferring files between HP-UX and systems that support the
LIF file format, including HP-UX Basic, Pascal, and other HP-UX systems.
For more info: lif(4) in the HP-UX Reference
uucp:
Use when the other system is a UNIX system (including HP-UX), connected
by modem lines, direct connection, or X.25 network, and with UUCP
utilities installed. uucp automatically reconciles differences in file
format between systems.
For more info: UUCP chapter of Remote Access: User's Guide
Internet Services:
Use when the both systems are connected via LAN to the same services.
The other system can be an HP-UX or UNIX system, or an MS-DOS personal
computer. Internet Services reconcile automatically any differences in
file format between systems.
For more info:
* Installing & Administering Internet Services
* Using Internet Services
* ftp(1M) in the HP-UX Reference
HP FTAM/9000:
Use when both systems are networking via OSI and using FTAM (OSI File
Transfer, Access, and Management). OSI is a multi-vendor standard compatible
with UNIX and non-UNIX operating systems. FTAM handles binary and ASCII
file transfers, but does no data conversion.
For more info:
* HP FTAM/9000 Reference Manual
* HP FTAM/9000 Programmer's Guide
* HP FTAM/9000 User's Guide
* Installing and Administering HP FTAM/9000
* FTAM/9000 Technical Addendum
* Release Notes: FTAM 9000
Network Services/9000:
Use when transferring files over LAN to any HP-UX platform.
For more info: Using Network Services
NFS:
Use when users on different HP-UX (and other UNIX) systems want
to share files. Explicit file transfers are unnecessary because
file-system access is transparent.
For more info:
* Installing and Administering NFS Services
cpio:
Use when transferring files by magnetic tape (cartridge or reel-to-reel)
to another UNIX system that supports the cpio format.
NOTE: cpio can be used with the tcio command to
ensure smoother tape access.
For more info: cpio(1) in the HP-UX Reference
ftio:
Use when copying files to magnetic tape (cartridge, reel-to-reel, or DDS
format). Faster throughput than either tar or cpio.
For more info: ftio(1) in the HP-UX Reference
tar:
Use when transferring files by magnetic tape (cartridge, reel-to-reel, DDS
format) to another UNIX system supporting the tar format.
NOTE: tar can be used with the tcio command to ensure smoother tape access.
However, tcio only works on cartridge tapes.
For more info: tar(1) in the HP-UX Reference
tcio:
Use when transferring files between cartridge tape units (including
autochanger) and a controlling HP-UX computer. tcio is typically used with
cpio or tar.
For more info: tcio(1) in the HP-UX Reference
fbackbup, frecover:
Use when transferring (typically backing up and restoring) files to
magnetic tape, standard out, DAT tape, rewritable magneto-optical disk,
or to a file. Combines features of dump and ftio.
For more info: fbackup(1M) and frecover(1M) in the HP-UX Reference
File Protection
===============
When created, each file in the file system is assigned a set of file
protections stored in the file permissions bits (often called the file's
mode). The file permission bits determine which classes of users may
read from the file, write to the file, or execute the program stored in
the file. Read, write, and execute permissions for a file can be set
for the file's owner, all members of the file's group (other than the
file's owner), and all other system users.
These three classes of users (user, group, and other) are mutually
independent; that is, no member of one class of users is included in any
other class of users. When a file is created, it is associated with an
owner and a group ID. For example, a file created by pjw in group dbase
is listed as being owned by user pjw of group dbase. These values
specify which user owns the file and which group has special access
capability.
The default permissions of a file are initially determined by umask (set
systemwide, in the users' environment file, or on the command line), or
by parameters passed to creat, mknod, or mkdir system calls when the
file is created. The permissions can be changed with the chmod command.
File permissions are represented as the binary form of four octal digits.
The initial discussion deals with only the three least significant digits.
When the most significant digit is not specified, its value is assumed to
be zero (0).
Organization of File Permission Bits
------------------------------------
| file | file | others
| owner | group |
+----+----+----+----+----+----+----+----+----+----+----+----+
binary | | | | | | | | | | | | |
+----+----+----+----+----+----+----+----+----+----+----+----+
| | | | | | | | |
| | exec | | exec | | exec
| write | write | write
read read read
Each three binary bits -- one bit to specify read permission, one bit to
specify write permission, and one bit to specify execute permission
for file owner, group, and others -- are interpreted as a single octal digit.
If the binary bit value is one, permission is granted for the associated
operation. If the bit value is zero, permission is denied.
Consider a file whose permission bits are set to 754 (octal). Octal 754 is
equivalent to 111 101 100 binary. The ll command represents this as
rwxr-xr--. The file's owner may read, write, and execute the file, while
read and execute permission is granted to members of the file-owner's group.
This includes any user (except the file's owner) whose effective group ID
equals the ID of the file's group, or whose group access list includes the
file's group ID. All other system users may only read the file.
Note, if a file has associated Access Control List (ACL) entries, "a" is
displayed following the permissions. By default, the chmod command deletes
any ACL entries, but you can use the -A option to preserve them. For more
information on ACLs, refer to acl(5) in the HP-UX Reference.
File Permission Bits of rwxr-xr--
---------------------------------
| file | file | others
| owner | group |
+----+----+----+----+----+----+----+----+----+----+----+----+
binary | | | | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |
+----+----+----+----+----+----+----+----+----+----+----+----+
| | | | | | | | |
| | exec | | exec | | exec
| write | write | write
read read read
as seen using ll: r w x r - x r - -
octal ______7_______ ______5______ ______4______
Protecting Directories
______________________
Directories, like all files in the HP-UX file system, have permissions.
The format of a directory's permission bits is identical to that of an
ordinary file; however, the read, write, and execute permissions have a
slightly different meaning when applied to a directory.
* Read permission grants access to display the contents of a
directory.
* Write permission grants access to add a file to the directory,
rename a file within the directory, and remove a file from the
directory. Users (even superusers) cannot write directly
to the directory itself. Only the kernel can write directly
to directories.
* Execute permission grants access to search a directory for a file.
If execute permission is not set, the files below that directory
in the file-system hierarchy cannot be accessed, even when you
supply the file's correct path name.
Setting the sticky bit on a directory provides additional protection to
files within the directory: files cannot be removed from the directory except
by the owner of the file, the owner of the directory, or a user having
appropriate privileges. (See rm(1) in the HP-UX Reference.)
Setting Effective User and Group ID Bits (suid, sgid)
_____________________________________________________
A process has effective user and group IDs that can be used to ensure
file security. Using user and group IDs, a file can be protected so
that when executed, the process's effective IDs are identical to the
file owner's IDs. This capability is specified through the most
significant digit of the four octal file protection digits.
The most significant digit is represented by three bits: set user ID,
set group ID, and stick bit. These bit values affect the capabilities of
file owner, group, and other.
When its most significant bit is 1, the effective user ID of the process
executing the file is set equal to the user ID of the file's owner. This bit
is called the set user ID bit (suid or setuid). Similarly, if the middle bit
of the most significant octal digit is 1, then the effective group ID of the
process executing the file is set equal to the group ID of the file's group.
This bit is called the set group ID bit (sgid or setgid).
If the sgid bit is set for an ordinary file, and the file does not have
group execute permission, the file is in enforcement locking mode.
Refer to the section "File Sharing and Locking" later in this paper, or
to the lockf(2) entry in the HP-UX Reference.
For example, consider a file whose permission bits are octal 6754.
The binary equivalent is 110 111 101 100, as shown below and explained
following the figure. Note that because the set user ID and set
group ID bits are set, the ll listing shows the letter s in the
execute bit of file owner and file group. If the sticky bit had been
set, the execute bit of others would be designated with the letter t.
Permission Bits of an suid/sgid file set to rwsr-sr--
-----------------------------------------------------
most
significant | file | file | others
bits | owner | group |
+----+----+----+----+----+----+----+----+----+----+----+----+
binary | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |
+----+----+----+----+----+----+----+----+----+----+----+----+
| | | | | | | | | | | |
| | sticky| | exec | | exec | | exec
| | bit | write | write | write
| | read read read
| set group ID
set user ID
as seen
using ll: r w s r - s r - -
octal _____6________ ______7_______ ______5______ ______4______
Explanation of File Permission Bits rwsr-sr--
---------------------------------------------
Most Significant Bits:
Octal digit: 6
Binary form: 110
Permissions:
set user ID: Effective user ID of the process executing this
file is set equal to the real user ID of the
file's owner.
set group: Effective group ID of the process executing this file
is set equal to the group ID of the file's group.
sticky bit: The sticky bit is not set; see "Protecting
Directories," earlier in this paper.
File Owner Permissions:
Octal digit: 7
Binary form: 111
Permissions:
read: File owner may read the file.
write: File owner may write to the file.
execute: File owner may execute the file.
File Group Permissions:
Octal digit: 5
Binary form: 101
Permissions:
read: Members of the file's group may read the file.
write: Members of the file's group may not write to the file.
execute: Members of the file's group may execute the file.
All Others Permissions
Octal digit: 4
Binary form: 100
Permissions:
read: Any other user may read the contents of the file.
write: No other users can write to the file.
execute: No other users can execute the file.
Access Control Lists
____________________
Access control lists (ACLs) offer a finer degree of file protection than
traditional file-mode protection bits. With ACLs, you can allow or restrict
file access to individual users, regardless of what group the users belong.
For additional information see acl(5) in the HP-UX Reference.
File Sharing and Locking
========================
In a multi-user, multi-tasking environment such as HP-UX, it is often
desirable to control interaction with files. Many applications share
disk files, and the status of information contained in them could have
serious implications to the user (such as lost or inaccurate information).
Imagine we are responsible for maintaining on-line technical reports for
a myriad of projects, and we have many different people who must have
simultaneous access to these reports. The content of a given report at
a given time could significantly affect a company decision, and so we
want a way to control how records are accessed.
One potential problem could arise if one person (let's call him George)
adds to or modifies information in a report while someone else (Sarah)
is working on it. Sarah is unaware of changes that George has just made
in the report. And once she is done, Sarah overwrites the information
George added. The result is that we have lost ALL of George's
information, and when Sarah added data she was unaware of information
that might have been pertinent.
Advisory Locks
______________
A solution to this problem common to file sharing is called file locking.
In HP-UX, file locking is done with the lockf or fcntl system calls, which
handle two modes of functionality. Advisory locks are placed on disk
resources to inform (warn) other processes desiring access that a file is
currently being accessed or modified. Advisory locks are only valuable for
cooperating processes that are both aware of and use file locking.
In our example, the programs used to access the on-line reports
can use advisory locks. When George begins to work on the Marketing
project his program can call lockf and set an advisory lock. A few
minutes later when Sarah tries to access records in the Marketing
report, she would get an error message indicating that the report is
busy. Her program could wait until George is done and then access the
report, by using the system call, lockf.
Enforcement Mode
________________
Even if we use advisory locks in our example, Sarah would still be able
to overwrite the Marketing report if she uses commands or utilities that
do not check for advisory locks. She needs some way to ensure that no
records are written until George finishes accessing the report. HP-UX
does this with enforcement mode. When a process attempts to read or
write to a locked record in a file opened in enforcement mode, the
process sleeps until the record is unlocked. Enforcement mode can be
used only on regular files.
Enforcement mode is enabled by setting the set-group-id bit (sgid) but not
the group execute bit. For example, if we opened a file whose permission bits
are set to 644, a long listing of the file would resemble:
-rw-r--r-- 1 george fiscal 512 May 7 16:11 Marketing
To enable enforcement mode, type:
chmod g+s Marketing
This command turns on the sgid bit, resulting in file protection of
2644. Enforcement mode can also be enabled by using the chmod system
call. After enforcement mode is enabled, a long listing shows:
-rw-r-Sr-- 1 george fiscal 512 May 7 16:11 Marketing
Using enforcement mode, George can prevent Sarah from overwriting
his changes, and Sarah would have the data that George has added.
When attempting to access a file locked under enforcement mode, the
process sleeps until the file is released. This provides a means for
one process to control execution of another. Be careful when doing
this, because a system deadlock is possible.
Locking Activities
__________________
All file locking is controlled with the lockf or fcntl system
calls. lockf controls four file actions:
* Testing file accessibility by checking to see if another process
is present on a specific file record.
* Attempting to lock a file. If the record is already locked by
another process, lockf puts the requesting process to sleep until
the record is free again.
* Testing file accessibility, locking the record if it is free, and
returning immediately if it is not.
* Unlocking a record previously locked by the requesting process.
When the locking process either closes the locked file or terminates,
all locks placed by that process are removed. For more details on how
specific locking activities work on HP-UX, refer to lockf(2) and
fcntl(2) in the HP-UX Reference.