The UNIX filesystem hierarchy contains a number of different filesystem types including disk-based filesystems such as VxFS and UFS and also pseudo filesystems such as procfs and tmpfs. This chapter describes concepts that relate to filesystems as a whole such as disk partitioning, mounting and unmounting of filesystems, and the main commands that operate on filesystems such as mkfs, mount, fsck, and df.

What_fs in a Filesystem?

At one time, filesystems were either disk based in which all files in the filesystem were held on a physical disk, or were RAM based. In the latter case, the filesystem only survived until the system was rebooted. However, the concepts and implementation are the same for both. Over the last 10 to 15 years a number of pseudo filesystems have been introduced, which to the user look like filesystems, but for which the implementation is considerably different due to the fact that they have no physical storage. Pseudo filesystems will be presented in more detail in Chapter 11. This chapter is primarily concerned with disk-based filesystems.

A UNIX filesystem is a collection of files and directories that has the following properties: 86 UNIX Filesystems.Evolution, Design, and Implementation

It has a root directory (/) that contains other files and directories. Most disk-based filesystems will also contain a lost+found directory where orphaned files are stored when recovered following a system crash.

Each file or directory is uniquely identified by its name, the directory in which it resides, and a unique identifier, typically called an inode.

By convention, the root directory has an inode number of 2 and the lost+found directory has an inode number of 3. Inode numbers 0 and 1 are not used. File inode numbers can be seen by specifying the -i option to ls.

It is self contained. There are no dependencies between one filesystem and any other.

A filesystem must be in a clean state before it can be mounted. If the system crashes, the filesystem is said to be dirty. In this case, operations may have been only partially completed before the crash and therefore the filesystem structure may no longer be intact. In such a case, the filesystem check program fsck must be run on the filesystem to check for any inconsistencies and repair any that it finds. Running fsck returns the filesystem to its clean state. The section Repairing Damaged Filesystems, later in this chapter, describes the fsck program in more detail.

The Filesystem Hierarchy

There are many different types of files in a complete UNIX operating system. These files, together with user home directories, are stored in a hierarchical tree structure that allows files of similar types to be grouped together. Although the UNIX directory hierarchy has changed over the years, the structure today still largely reflects the filesystem hierarchy developed for early System V and BSD variants.

For both root and normal UNIX users, the PATH shell variable is set up during login to ensure that the appropriate paths are accessible from which to run commands. Because some directories contain commands that are used for administrative purposes, the path for root is typically different from that of normal users. For example, on Linux the path for a root and non root user may be:

 # echo $PATH
 /usr/sbin:/sbin:/bin:/usr/bin:/usr/local/bin:/usr/bin/X11:/root/bin
 $ echo $PATH
 /home/spate/bin:/usr/bin:/bin:/usr/bin/X11:/usr/local/bin:
 /home/spate/office52/program

The following list shows the main UNIX directories and the type of files that reside in each directory. Note that this structure is not strictly followed among the different UNIX variants but there is a great deal of commonality among all of them.

/usr. This is the main location of binaries for both user and administrative purposes.

/usr/bin. This directory contains user binaries.

/usr/sbin. Binaries that are required for system administration purposes are stored here. This directory is not typically on a normal user_fs path. On some versions of UNIX, some of the system binaries are stored in /sbin.

/usr/local. This directory is used for locally installed software that is typically separate from the OS. The binaries are typically stored in

/usr/local/bin.

/usr/share. This directory contains architecture-dependent files including ASCII help files. The UNIX manual pages are typically stored in

/usr/share/man.

/usr/lib. Dynamic and shared libraries are stored here.

/usr/ucb. For non-BSD systems, this directory contains binaries that originated in BSD.

/usr/include. User header files are stored here. Header files used by the kernel are stored in /usr/include/sys.

/usr/src. The UNIX kernel source code was once held in this directory although this hasn_ft been the case for a long time, Linux excepted.

/bin. Has been a symlink to /usr/bin for quite some time.

/dev. All of the accessible device files are stored here.

/etc. Holds configuration files and binaries which may need to be run before other filesystems are mounted. This includes many startup scripts and configuration files which are needed when the system bootstraps.

/var. System log files are stored here. Many of the log files are stored in /var/log.

/var/adm. UNIX accounting files and system login files are stored here. /var/preserve. This directory is used by the vi and ex editors for storing backup files.

/var/tmp. Used for user temporary files.

/var/spool. This directory is used for UNIX commands that provide spooling services such as uucp, printing, and the cron command.

/home. User home directories are typically stored here. This may be

/usr/home on some systems. Older versions of UNIX and BSD often store user home directories under /u. 88 UNIX Filesystems.Evolution, Design, and Implementation /tmp. This directory is used for temporary files. Files residing in this directory will not necessarily be there after the next reboot. /opt. Used for optional packages and binaries. Third-party software vendors store their packages in this directory.

When the operating system is installed, there are typically a number of filesystems created. The root filesystem contains the basic set of commands, scripts, configuration files, and utilities that are needed to bootstrap the system. The remaining files are held in separate filesystems that are visible after the system bootstraps and system administrative commands are available. For example, shown below are some of the mounted filesystems for an active Solaris system:

/proc on /proc read/write/setuid
 / on /dev/dsk/c1t0d0s0 read/write/setuid
 /dev/fd on fd read/write/setuid
 /var/tmp on /dev/vx/dsk/sysdg/vartmp read/write/setuid/tmplog
 /tmp on /dev/vx/dsk/sysdg/tmp read/write/setuid/tmplog
 /opt on /dev/vx/dsk/sysdg/opt read/write/setuid/tmplog
 /usr/local on /dev/vx/dsk/sysdg/local read/write/setuid/tmplog
 /var/adm/log on /dev/vx/dsk/sysdg/varlog read/write/setuid/tmplog
 /home on /dev/vx/dsk/homedg/home read/write/setuid/tmplog

During installation of the operating system, there is typically a great deal of flexibility allowed so that system administrators can tailor the number and size of filesystems to their specific needs. The basic goal is to separate those filesystems that need to grow from the root filesystem, which must remain stable. If the root filesystem becomes full, the system becomes unusable.

Disks, Slices, Partitions, and Volumes

Each hard disk is typically split into a number of separate, different sized units called partitions or slices. Note that is not the same as a partition in PC terminology. Each disk contains some form of partition table, called a VTOC (Volume Table Of Contents) in SVR4 terminology, which describes where the slices start and what their size is. Each slice may then be used to store bootstrap information, a filesystem, swap space, or be left as a raw partition for database access or other use.

Disks can be managed using a number of utilities. For example, on Solaris and many SVR4 derivatives, the prtvtoc and fmthard utilities can be used to edit the VTOC to divide the disk into a number of slices. When there are many disks, this hand editing of disk partitions becomes tedious and very error prone.

For example, here is the output of running the prtvtoc command on a root disk on Solaris:

# prtvtoc /dev/rdsk/c0t0d0s0
 * /dev/rdsk/c0t0d0s0 partition map
 Filesystem-Based Concepts 89
 *
 * Dimensions:
 * 512 bytes/sector
 * 135 sectors/track
 * 16 tracks/cylinder
 * 2160 sectors/cylinder
 * 3882 cylinders
 * 3880 accessible cylinders
 *
 * Flags:
 * 1: unmountable
 * 10: read-only
 *
 * First Sector Last
 * Partition Tag Flags Sector Count Sector Mount Dir
 0 2 00 0 788400 788399 /
 1 3 01 788400 1049760 1838159
 2 5 00 0 8380800 8380799
 4 0 00 1838160 4194720 6032879 /usr
 6 4 00 6032880 2347920 8380799 /opt

The partition tag is used to identify each slice such that c0t0d0s0 is the slice that holds the root filesystem, c0t0d0s4 is the slice that holds the /usr filesystem, and so on.

The following example shows partitioning of an IDE-based, root Linux disk. Although the naming scheme differs, the concepts are similar to those shown previously.

# fdisk /dev/hda
 Command (m for help): p
 Disk /dev/hda: 240 heads, 63 sectors, 2584 cylinders
 Units = cylinders of 15120 * 512 bytes
 Device Boot Start End Blocks Id System
 /dev/hda1 * 1 3 22648+ 83 Linux
 /dev/hda2 556 630 567000 6 FAT16
 /dev/hda3 4 12 68040 82 Linux swap
 /dev/hda4 649 2584 14636160 f Win95 Ext'd (LBA)
 /dev/hda5 1204 2584 10440328+ b Win95 FAT32
 /dev/hda6 649 1203 4195737 83 Linux

Logical volume managers provide a much easier way to manage disks and create new slices (called logical volumes). The volume manager takes ownership of the disks and gives out space as requested. Volumes can be simple, in which case the volume simply looks like a basic raw disk slice, or they can be mirrored or striped. For example, the following command can be used with the VERITAS Volume Manager, VxVM, to create a new simple volume:

# vxassist make myvol 10g
 # vxprint myvol
 90 UNIX Filesystems.Evolution, Design, and Implementation
 Disk group: rootdg
 TY NAME ASSOC KSTATE LENGTH PLOFFS STATE
 v myvol fsgen ENABLED 20971520 ACTIVE
 pl myvol-01 myvol ENABLED 20973600 ACTIVE
 sd disk12-01 myvol-01 ENABLED 8378640 0 -
 sd disk02-01 myvol-01 ENABLED 8378640 8378640 -
 sd disk03-01 myvol-01 ENABLED 4216320 16757280 -

VxVM created the new volume, called myvol, from existing free space. In this case, the 1GB volume was created from three separate, contiguous chunks of disk space that together can be accessed like a single raw partition.

Raw and Block Devices

With each disk slice or logical volume there are two methods by which they can be accessed, either through the raw (character) interface or through the block interface. The following are examples of character devices:

# ls -l /dev/vx/rdsk/myvol
 crw------ 1 root root 86, 8 Jul 9 21:36 /dev/vx/rdsk/myvol
 # ls -lL /dev/rdsk/c0t0d0s0
 crw------ 1 root sys 136, 0 Apr 20 09:51 /dev/rdsk/c0t0d0s0

while the following are examples of block devices:

# ls -l /dev/vx/dsk/myvol
 brw------ 1 root root 86, 8 Jul 9 21:11 /dev/vx/dsk/myvol
 # ls -lL /dev/dsk/c0t0d0s0
 brw------ 1 root sys 136, 0 Apr 20 09:51 /dev/dsk/c0t0d0s0

Note that both can be distinguished by the first character displayed (b or c) or through the location of the device file. Typically, raw devices are accessed through /dev/rdsk while block devices are accessed through /dev/dsk. When accessing the block device, data is read and written through the system buffer cache. Although the buffers that describe these data blocks are freed once used, they remain in the buffer cache until they get reused. Data accessed through the raw or character interface is not read through the buffer cache. Thus, mixing the two can result in stale data in the buffer cache, which can cause problems.

All filesystem commands, with the exception of the mount command, should therefore use the raw/character interface to avoid this potential caching problem.

Filesystem Switchout Commands

Many of the commands that apply to filesystems may require filesystem specific processing. For example, when creating a new filesystem, each different Filesystem-Based Concepts 91 filesystem may support a wide range of options. Although some of these options will be common to most filesystems, many may not be.

To support a variety of command options, many of the filesystem-related commands are divided into generic and filesystem dependent components. For example, the generic mkfs command that will be described in the next section, is invoked as follows:

# mkfs -F vxfs -o ...

The -F option (-t on Linux) is used to specify the filesystem type. The -o option is used to specify filesystem-specific options. The first task to be performed by mkfs is to do a preliminary sanity check on the arguments passed. After this has been done, the next job is to locate and call the filesystem specific mkfs function.

Take for example the call to mkfs as follows:

# mkfs -F nofs /dev/vx/rdsk/myvol
 mkfs: FSType nofs not installed in the kernel

Because there is no filesystem type of nofs, the generic mkfs command is unable to locate the nofs version of mkfs. To see how the search is made for the filesystem specific mkfs command, consider the following:

# truss -o /tmp/truss.out mkfs -F nofs /dev/vx/rdsk/myvol
 mkfs: FSType nofs not installed in the kernel
 # grep nofs /tmp/truss.out
 execve("/usr/lib/fs/nofs/mkfs", 0x000225C0, 0xFFBEFDA8) Err#2 ENOENT
 execve("/etc/fs/nofs/mkfs", 0x000225C0, 0xFFBEFDA8) Err#2 ENOENT
 sysfs(GETFSIND, "nofs") Err#22 EINVAL

In this case, the generic mkfs command assumes that commands for the nofs filesystem will be located in one of the two directories shown above. In this case, the files don_ft exist. As a finally sanity check, a call is made to sysfs() to see if there actually is a filesystem type called nofs.

Consider the location of the generic and filesystem-specific fstyp commands in Solaris:

# which fstyp
 /usr/sbin/fstyp
 # ls /usr/lib/fs
 autofs/ fd/ lofs/ nfs/ proc/ udfs/ vxfs/
 cachefs/ hsfs/ mntfs/ pcfs/ tmpfs/ ufs/
 # ls /usr/lib/fs/ufs/fstyp
 /usr/lib/fs/ufs/fstyp
 # ls /usr/lib/fs/vxfs/fstyp
 /usr/lib/fs/vxfs/fstyp

Using this knowledge it is very straightforward to write a version of the generic fstyp command as follows:

1 #include 
 2 #include 
 3 #include 
 4
 5 main(int argc, char **argv)
 6 {
 7 char cmd[256];
 8
 9 if (argc != 4 && (strcmp(argv[1], "-F") != 0)) {
 10 printf("usage: myfstyp -F fs-type\n");
 11 exit(1);
 12 }
 13 sprintf(cmd, "/usr/lib/fs/%s/fstyp", argv[2]);
 14 if (execl(cmd, argv[2], argv[3], NULL) < 0) {
 15 printf("Failed to find fstyp command for %s\n",
 16 argv[2]);
 17 }
 18 if (sysfs(GETFSTYP, argv[2]) < 0) {
 19 printf("Filesystem type %s doesn_ft exist\n",
 20 argv[2]);
 21 }
 22 }

This version requires that the filesystem type to search for is specified. If it is located in the appropriate place, the command is executed. If not, a check is made to see if the filesystem type exists as the following run of the program shows:

# myfstyp -F vxfs /dev/vx/rdsk/myvol
 vxfs
 # myfstyp -F nofs /dev/vx/rdsk/myvol
 Failed to find fstyp command for nofs
 Filesystem type "nofs" doesn_ft exist

Creating New Filesystems

Filesystems can be created on raw partitions or logical volumes. For example, in the prtvtoc output shown above, the root (/) filesystem was created on the raw disk slice /dev/rdsk/c0t0d0s0 and the /usr filesystem was created on the raw disk slice /dev/rdsk/c0t0d0s4.

The mkfs command is most commonly used to create a new filesystem, although on some platforms the newfs command provides a more friendly interface and calls mkfs internally. The type of filesystem to create is passed to mkfs as an argument. For example, to create a VxFS filesystem, this would be achieved by invoking mkfs -F vxfs on most UNIX platforms. On Linux, the call would be mkfs -t vxfs.

The filesystem type is passed as an argument to the generic mkfs command (-F or -t). This is then used to locate the switchout command by searching well-known locations as shown above. The following two examples show how to Filesystem-Based Concepts 93 create a VxFS filesystem. In the first example, the size of the filesystem to create is passed as an argument. In the second example, the size is omitted, in which case VxFS determines the size of the device and creates a filesystem of that size.

# mkfs -F vxfs /dev/vx/rdsk/vol1 25g
 version 4 layout
 52428800 sectors, 6553600 blocks of size 4096,
 log size 256 blocks unlimited inodes, largefiles not supported
 6553600 data blocks, 6552864 free data blocks
 200 allocation units of 32768 blocks, 32768 data blocks
 # mkfs -F vxfs /dev/vx/rdsk/vol1
 version 4 layout
 54525952 sectors, 6815744 blocks of size 4096,
 log size 256 blocks unlimited inodes, largefiles not supported
 6815744 data blocks, 6814992 free data blocks
 208 allocation units of 32768 blocks, 32768 data blocks

The following example shows how to create a UFS filesystem. Note that although the output is different, the method of invoking mkfs is similar for both VxFS and UFS.

# mkfs -F ufs /dev/vx/rdsk/vol1 54525952
 /dev/vx/rdsk/vol1: 54525952 sectors in 106496 cylinders of
 16 tracks, 32 sectors
 26624.0MB in 6656 cyl groups (16 c/g, 4.00MB/g, 1920 i/g)
 super-block backups (for fsck -F ufs -o b=#) at:
 32, 8256, 16480, 24704, 32928, 41152, 49376, 57600, 65824,
 74048, 82272, 90496, 98720, 106944, 115168, 123392, 131104,
 139328, 147552, 155776, 164000,
 ...
 54419584, 54427808, 54436032, 54444256, 54452480, 54460704,
 54468928, 54477152, 54485376, 54493600, 54501824, 54510048,

The time taken to create a filesystem differs from one filesystem type to another. This is due to how the filesystems lay out their structures on disk. In the example above, it took UFS 23 minutes to create a 25GB filesystem, while for VxFS it took only half a second. Chapter 9 describes the implementation of various filesystems and shows how this large difference in filesystem creation time can occur.

Additional arguments can be passed to mkfs through use of the -o option, for example:

# mkfs -F vxfs -obsize=8192,largefiles /dev/vx/rdsk/myvol
 version 4 layout
 20971520 sectors, 1310720 blocks of size 8192,
 log size 128 blocks
 unlimited inodes, largefiles not supported
 1310720 data blocks, 1310512 free data blocks
 40 allocation units of 32768 blocks, 32768 data blocks
 94 UNIX Filesystems.Evolution, Design, and Implementation

For arguments specified using the -o option, the generic mkfs command will pass the arguments through to the filesystem specific mkfs command without trying to interpret them.

Mounting and Unmounting Filesystems

The root filesystem is mounted by the kernel during system startup. Each filesystem can be mounted on any directory in the root filesystem, except /. A mount point is simply a directory. When a filesystem is mounted on that directory, the previous contents of the directory are hidden for the duration of the mount, as shown in Figure 5.1.

In order to mount a filesystem, the filesystem type, the device (slice or logical volume), and the mount point must be passed to the mount command. In the example below, a VxFS filesystem is mounted on /mnt1. Running the mount command by itself shows all the filesystems that are currently mounted, along with their mount options:

# mount -F vxfs /dev/vx/dsk/vol1 /mnt1
 # mount | grep mnt1
 /mnt1 on /dev/vx/dsk/vol1 read/write/setuid/delaylog/
 nolargefiles/ioerror=mwdisable/dev=1580006
 on Tue Jul 3 09:40:27 2002

Note that the mount shows default mount options as well as options that were explicitly requested. On Linux, the -t option is used to specify the filesystem type so the command would be invoked with mount -t vxfs.

As with mkfs, the mount command is a switchout command. The generic mount runs first and locates the filesystem-specific command to run, as the following output shows. Note the use of the access() system call. There are a number of well-known locations for which the filesystem-dependent mount command can be located.

1379: execve("/usr/sbin/mount", 0xFFBEFD8C, 0xFFBEFDA4) argc = 5
 ...
 1379: access("/usr/lib/fs/vxfs/mount", 0) Err#2 ENOENT
 1379: execve("/etc/fs/vxfs/mount", 0xFFBEFCEC, 0xFFBEFDA4) argc = 3
 ...
 1379: mount("/dev/vx/dsk/vol1", "/mnt1", MS_DATA|MS_OPTIONSTR,
 "vxfs", 0xFFBEFBF4, 12) = 0
 ...

When a filesystem is mounted, an entry is added to the mount table, which is a file held in /etc that records all filesystems mounted, the devices on which they reside, the mount points on which they_fre mounted, and a list of options that were passed to mount or which the filesystem chose as defaults. TEAMFLY

The actual name chosen for the mount table differs across different versions of UNIX. On all System V variants, it is called mnttab, while on Linux and BSD variants it is called mtab.

Shown below are the first few lines of /etc/mnttab on Solaris followed by the contents of a /etc/mtab on Linux:

# head -6 /etc/mnttab
 /proc /proc proc rw,suid,dev=2f80000 995582515
 /dev/dsk/c1t0d0s0 / ufs rw,suid,dev=1d80000,largefiles 995582515
 fd /dev/fd fd rw,suid,dev=3080000 995582515
 /dev/dsk/c1t1d0s0 /space1 ufs ro,largefiles,dev=1d80018 995582760
 /dev/dsk/c1t2d0s0 /rootcopy ufs ro,largefiles,dev=1d80010
 995582760
 /dev/vx/dsk/sysdg/vartmp /var/tmp vxfs rw,tmplog,suid,nolargefiles
 995582793
 # cat /etc/mtab
 /dev/hda6 / ext2 rw 0 0
 none /proc proc rw 0 0
 usbdevfs /proc/bus/usb usbdevfs rw 0 0
 /dev/hda1 /boot ext2 rw 0 0
 none /dev/pts devpts rw,gid=5,mode=620 0 0

All versions of UNIX provide a set of routines for manipulating the mount table, either for adding entries, removing entries, or simply reading them. Listed below are two of the functions that are most commonly available:

#include < stdio.h>
 #include < sys/mnttab.h>
 int getmntent(FILE *fp, struct mnttab *mp);
 int putmntent(FILE *iop, struct mnttab *mp);

The getmntent(L) function is used to read entries from the mount table while putmntent(L) can be used to remove entries. Both functions operate on the mnttab structure, which will contain at least the following members:

char *mnt_special; /* The device on which the fs resides */
 char *mnt_mountp; /* The mount point */
 char *mnt_fstype; /* The filesystem type */
 char *mnt_mntopts; /* Mount options */
 char *mnt_time; /* The time of the mount */

Using the getmntent(L) library routine, it is very straightforward to write a simple version of the mount command that, when run with no arguments, displays the mounted filesystems by reading entries from the mount table. The program, which is shown below, simply involves opening the mount table and then making repeated calls to getmntent(L) to read all entries.

1 #include < stdio.h>
 2 #include < sys/mnttab.h>
 3
 4 main()
 5 {
 6 struct mnttab mt;
 7 FILE *fp;
 8
 9 fp = fopen("/etc/mnttab", _gr_h);
 10
 11 printf("%-15s%-10s%-30s\n",
 12 "mount point", "fstype", "device");
 13 while ((getmntent(fp, &mt)) != -1) {
 14 printf("%-15s%-10s%-30s\n", mt.mnt_mountp,
 15 mt.mnt_fstype, mt.mnt_special);
 16 }
 17 }

Each time getmntent(L) is called, it returns the next entry in the file. Once all entries have been read, -1 is returned. Here is an example of the program running:

$ mymount | head -7
 /proc proc /proc
 Filesystem-Based Concepts 97
 / ufs /dev/dsk/c1t0d0s0
 /dev/fd fd fd
 /space1 ufs /dev/dsk/c1t1d0s0
 /var/tmp vxfs /dev/vx/dsk/sysdg/vartmp
 /tmp vxfs /dev/vx/dsk/sysdg/tmp

On Linux, the format of the mount table is slightly different and the getmntent(L) function operates on a mntent structure. Other than minor differences with field names, the following program is almost identical to the one shown above:

1 #include < stdio.h>
 2 #include < mntent.h>
 3
 4 main()
 5 {
 6 struct mntent *mt;
 7 FILE *fp;
 8
 9 fp = fopen("/etc/mtab", "r");
 10
 11 printf("%-15s%-10s%-30s\n",
 12 "mount point", "fstype", "device");
 13 while ((mt = getmntent(fp)) != NULL) {
 14 printf("%-15s%-10s%-30s\n", mt->mnt_dir,
 15 mt->mnt_type, mt->mnt_fsname);
 16 }
 17 }

Following is the output when the program runs:

$ lmount
 mount point fstype device
 / ext2 /dev/hda6
 /proc proc none
 /proc/bus/usb usbdevfs usbdevfs
 /boot ext2 /dev/hda1
 /dev/pts devpts none
 /mnt1 vxfs /dev/vx/dsk/myvol

To unmount a filesystem either the mount point or the device can be passed to the umount command, as the following examples show:

# umount /mnt1
 # mount | grep mnt1
 # mount -F vxfs /dev/vx/dsk/vol1 /mnt1
 # mount | grep mnt1
 /mnt1 on /dev/vx/dsk/vol1 read/write/setuid/delaylog/ ...
 # umount /dev/vx/dsk/vol1
 # mount | grep mnt1

After each invocation of umount, the entry is removed from the mount table. 98 UNIX Filesystems.Evolution, Design, and Implementation Mount and Umount System Call Handling As the preceding examples showed, the mount and umount commands result in a call to the mount() and umount() system calls respectively.

#include < sys/types.h>
 #include < sys/mount.h>
 int mount(const char *spec, const char *dir, int mflag, /*
 char *fstype, const char *dataptr, int datalen */ ...);
 #include 
 int umount(const char *file);

Usually there should never be a direct need to invoke either the mount() or umount() system calls. Although many of the arguments are self explanatory, the handling of per-filesystem options, as pointed to by dataptr, is not typically published and often changes. If applications have a need to mount and unmount filesystems, the system(L) library function is recommended as a better choice.

Mounting Filesystems Automatically

As shown in the next section, after filesystems are created, it is typically left to the system to mount them during bootstrap. The virtual filesystem table, called /etc/vfstab on System V variants and /etc/fstab on BSD variants, contains all the necessary information about each filesystem to be mounted.

This file is partially created during installation of the operating system. When new filesystems are created, the system administrator will add new entries ensuring that all the appropriate fields are entered correctly. Shown below is an example of the vfstab file on Solaris:

# cat /etc/vfstab
 ...
 fd - /dev/fd fd - no -
 /proc - /proc proc - no -
 /dev/dsk/c0t0d0s0 /dev/rdsk/c0t0d0s0 / ufs 1 no -
 /dev/dsk/c0t0d0s6 /dev/rdsk/c0t0d0s6 /usr ufs 1 no -
 /dev/dsk/c0t0d0s4 /dev/rdsk/c0t0d0s4 /c ufs 2 yes -
 ...

Here the fields are separated by spaces or tabs. The first field shows the block device (passed to mount), the second field shows the raw device (passed to fsck), the third field specifies the mount point, and the fourth specifies the filesystem type. The remaining three fields specify the order in which the filesystems will be checked, whether they should be mounted during bootstrap, and what options should be passed to the mount command.

Here is an example of a Linux fstab table:

 # cat /etc/fstab
 LABEL=/ / ext2 defaults 1 1
 LABEL=/boot /boot ext2 defaults 1 2
 /dev/cdrom /mnt/cdrom iso9660 noauto,owner,ro 0 0
 /dev/fd0 /mnt/floppy auto noauto,owner 0 0
 none /proc proc defaults 0 0
 none /dev/pts devpts gid=5,mode=620 0 0
 /dev/hda3 swap swap defaults 0 0
 /SWAP swap swap defaults 0 0

The first four fields describe the device, mount point, filesystem type, and options to be passed to mount. The fifth field is related to the dump command and records which filesystems need to be backed up. The sixth field is used by the fsck program to determine the order in which filesystems should be checked during bootstrap.

Mounting Filesystems During Bootstrap

Once filesystems are created and entries placed in /etc/vfstab, or equivalent, there is seldom need for administrator intervention. This file is accessed during system startup to mount all filesystems before the system is accessible to most applications and users.

When the operating system bootstraps, the kernel is read from a well-known location of disk and then goes through basic initialization tasks. One of these tasks is to mount the root filesystem. This is typically the only filesystem that is mounted until the system rc scripts start running.

The init program is spawned by the kernel as the first process (process ID of 1). By consulting the inittab(F) file, it determines which commands and scripts it needs to run to bring the system up further. This sequence of events can differ between one system and another. For System V-based systems, the rc scripts are located in /etc/rcX.d where X corresponds to the run level at which init is running.

Following are a few lines from the inittab(F) file:

$ head -9 inittab
 ap::sysinit:/sbin/autopush -f /etc/iu.ap
 ap::sysinit:/sbin/soconfig -f /etc/sock2path
 fs::sysinit:/sbin/rcS sysinit
 is:3:initdefault:
 p3:s1234:powerfail:/usr/sbin/shutdown -y -i5 -g0
 sS:s:wait:/sbin/rcS
 s0:0:wait:/sbin/rc0
 s1:1:respawn:/sbin/rc1
 s2:23:wait:/sbin/rc2

Of particular interest is the last line. The system goes multiuser at init state 2. This is achieved by running the rc2 script which in turn runs all of the scripts found in /etc/rc2.d. Of particular interest is the script S01MOUNTFSYS. This is 100 UNIX Filesystems.Evolution, Design, and Implementation the script that is responsible for ensuring that all filesystems are checked for consistency and mounted as appropriate. The mountall script is responsible for actually mounting all of the filesystems.

The layout of files and scripts used on non-System V variants differs, but the concepts are the same.

Repairing Damaged Filesystems

A filesystem can typically be in one of two states, either clean or dirty. To mount a filesystem it must be clean, which means that it is structurally intact. When filesystems are mounted read/write, they are marked dirty to indicate that there is activity on the filesystem. Operations may be pending on the filesystem during a system crash, which could leave the filesystem with structural damage. In this case it can be dangerous to mount the filesystem without knowing the extent of the damage. Thus, to return the filesystem to a clean state, a filesystem-specific check program called fsck must be run to repair any damage that might exist.

For example, consider the following call to mount after a system crash:

# mount -F vxfs /dev/vx/dsk/vol1 /mnt1
UX:vxfs mount: ERROR: /dev/vx/dsk/vol1 is corrupted. needs checking

The filesystem is marked dirty and therefore the mount fails. Before it can be mounted again, the VxFS fsck program must be run as follows:

# fsck -F vxfs /dev/vx/rdsk/vol1
 log replay in progress
 replay complete marking super-block as CLEAN

VxFS is a transaction-based filesystem in which structural changes made to the filesystem are first written to the filesystem log. By replaying the transactions in the log, the filesystem returns to its clean state.

Most UNIX filesystems are not transaction-based, and therefore the whole filesystem must be checked for consistency. In the example below, a full fsck is performed on a UFS filesystem to show the type of checks that will be performed. UFS on most versions of UNIX is not transaction-based although Sun has added journaling support to its version of UFS.

# fsck -F ufs -y /dev/vx/rdsk/myvol
 ** /dev/vx/dsk/myvol
 ** Last Mounted on /mnt1
 ** Phase 1 Check Blocks and Sizes
 ** Phase 2 Check Pathnames
 ** Phase 3 Check Connectivity
 ** Phase 4 Check Reference Counts
 ** Phase 5 Check Cyl groups
 61 files, 13 used, 468449 free (41 frags, 58551 blocks, 0
 .0% fragmentation)

Running fsck is typically a non-interactive task performed during system initialization. Interacting with fsck is not something that system administrators will typically need to do. Recording the output of fsck is always a good idea in case fsck fails to clean the filesystem and support is needed by filesystem vendors and/or developers.

The Filesystem Debugger

When things go wrong with filesystems, it is necessary to debug them in the same way that it is necessary to debug other applications. Most UNIX filesystems have shipped with the filesystem debugger, fsdb, which can be used for that purpose.

It is with good reason that fsdb is one of the least commonly used of the UNIX commands. In order to use fsdb effectively, knowledge of the filesystem structure on disk is vital, as well as knowledge of how to use the filesystem specific version of fsdb. Note that one version of fsdb does not necessarily bear any resemblance to another.

In general, fsdb should be left well alone. Because it is possible to damage the filesystem beyond repair, its use should be left for filesystem developers and support engineers only.

Per Filesystem Statistics

In the same way that the stat() system call can be called to obtain per-file related information, the statvfs() system call can be invoked to obtain per-filesystem information. Note that this information will differ for each different mounted filesystem so that the information obtained for, say, one VxFS filesystem, will not necessarily be the same for other VxFS filesystems.

#include < sys/types.h>
 #include < sys/statvfs.h>
 int statvfs(const char *path, struct statvfs *buf);
 int fstatvfs(int fildes, struct statvfs *buf);

Both functions operate on the statvfs structure, which contains a number of filesystem-specific fields including the following:

u_long f_bsize; /* file system block size */
 u_long f_frsize; /* fundamental filesystem block
 (size if supported) */
 fsblkcnt_t f_blocks; /* total # of blocks on file system
 in units of f_frsize */
 fsblkcnt_t f_bfree; /* total # of free blocks */
 fsblkcnt_t f_bavail; /* # of free blocks avail to
 non-super-user */
 fsfilcnt_t f_files; /* total # of file nodes (inodes) */
 102 UNIX Filesystems.Evolution, Design, and Implementation
 fsfilcnt_t f_ffree; /* total # of free file nodes */
 fsfilcnt_t f_favail; /* # of inodes avail to non-suser*/
 u_long f_fsid; /* file system id (dev for now) */
 char f_basetype[FSTYPSZ]; /* fs name null-terminated */
 u_long f_flag; /* bit mask of flags */
 u_long f_namemax; /* maximum file name length */
 char f_fstr[32]; /* file system specific string */

The statvfs(L) function is not available on Linux. In its place is the statfs(L) function that operates on the statfs structure. The fields of this structure are very similar to the statvfs structure, and therefore implementing commands such as df require very little modification if written for a system complying with the Single UNIX Specification.

The following program provides a simple implementation of the df command by invoking statvfs(L) to obtain per filesystem statistics as well as locating the entry in the /etc/vfstab file:

1 #include < stdio.h>
 2 #include < sys/types.h>
 3 #include < sys/statvfs.h>
 4 #include < sys/mnttab.h>
 5
 6 #define Kb (stv.f_frsize / 1024)
 7
 8 main(int argc, char **argv)
 9 {
 10 struct mnttab mt, mtp;
 11 struct statvfs stv;
 12 int blocks, used, avail, capacity;
 13 FILE *fp;
 14
 15 statvfs(argv[1], &stv);
 16
 17 fp = fopen("/etc/mnttab", "r");
 18 memset(&mtp, 0, sizeof(struct mnttab));
 19 mtp.mnt_mountp = argv[1];
 20 getmntany(fp, &mt, &mtp);
 21
 22 blocks = stv.f_blocks * Kb;
 23 used = (stv.f_blocks - stv.f_bfree) * Kb;
 24 avail = stv.f_bfree * Kb;
 25 capacity = ((double)used / (double)blocks) * 100;
 26 printf("Filesystem kbytes used "
 27 "avail capacity Mounted on\n");
 28 printf("%-22s%-7d%8d%8d %2d%% %s\n",
 29 mt.mnt_special, blocks, used, avail,
 30 capacity, argv[1]);
 31 }

In the output shown next, the df command is run first followed by output from the example program:

 $ df -k /h
 Filesystem kbytes used avail capacity Mounted on
 /dev/vx/dsk/homedg/h 7145728 5926881 1200824 84% /h
 $ mydf /h
 Filesystem kbytes used avail capacity Mounted on
 /dev/vx/dsk/homedg/h 7145728 5926881 1218847 82% /h

In practice, there is a lot of formatting work needed by df due to the different sizes of device names, mount paths, and the additional information displayed about each filesystem.

Note that the preceding program has no error checking. As an exercise, enhance the program to add error checking. On Linux the program needs modification to access the /etc/mtab file and to use the statfs(L) function. The program can be enhanced further to display all entries on the mount table as well as accept some of the other options that df provides.

User and Group Quotas

Although there may be multiple users of a filesystem, it is possible for a single user to consume all of the space within the filesystem. User and group quotas provide the mechanisms by which the amount of space used by a single user or all users within a specific group can be limited to a value defined by the administrator.

Quotas are based on the number of files used and the number of blocks. Some filesystems have a limited number of inodes available. Even though the amount of space consumed by a user may be small, it is still possible to consume all of the files in the filesystem even though most of the free space is still available.

Quotas operate around two limits that allow the user to take some action if the amount of space or number of disk blocks start to exceed the administrator defined limits:

Soft Limit. If the user exceeds the limit defined, there is a grace period that allows the user to free up some space. The quota can be exceeded during this time. However, after the time period has expired, no more files or data blocks may be allocated.

Hard Limit. When the hard limit is reached, regardless of the grace period, no further files or blocks can be allocated.

The grace period is set on a per-filesystem basis. For the VxFS filesystem, the default is seven days. The soft limit allows for users running applications that may create a lot of temporary files that only exist for the duration of the application. If the soft limit is exceeded, no action is taken. After the application exits, the temporary files are removed, and the amount of files and/or disk blocks goes back under the soft limit once more. Another circumstance when the soft limit is exceeded occurs when allocating space to a file. If files are written to 104 UNIX Filesystems.Evolution, Design, and Implementation sequentially, some filesystems, such as VxFS, allocate large extents (contiguous data blocks) to try to keep file data in one place. When the file is closed, the portion of the extent unused is freed.

In order for user quotas to work, there must be a file called quotas in the root directory of the filesystem. Similarly, for group quotas, the quotas.grp file must be present. Both of these files are used by the administrator to set quota limits for users and/or groups. If both user and group quotas are used, the amount of space allocated to a user is the lower of the two limits.

There are a number of commands to administer quotas. Those shown here are provided by VxFS. UFS provides a similar set of commands. Each command can take a -u or -g option to administer user and group quotas respectively.

vxedquota. This command can be used to edit the quota limits for users and groups.

vxrepquota. This command provides a summary of the quota limits together with disk usage.

vxquot. This command displays file ownership and usage summaries.

vxquota. This command can be used to view quota limits and usage.

vxquotaon. This command turns on quotas for a specified VxFS filesystem.

vxquotaoff. This command turns off quotas for the specified filesystem.

Quota checks are performed when the filesystem is mounted. This involves reading all inodes on disk and calculating usage for each user and group if needed.

Summary

This chapter described the main concepts applicable to filesystems as a whole, how they are created and mounted, and how they are repaired if damaged by a system crash or other means. Although the format of some of the mount tables differs between one system and the next, the location of the files differ only slightly, and the principles apply across all systems. In general, unless administrating a UNIX-based machine, many of the commands here will not be used by the average UNIX user. However, having a view of how filesystems are managed helps gain a much better understanding of filesystems overall.

CHAPTER 6 UNIX Kernel Concepts

This chapter covers the earlier versions of UNIX up to 7th Edition and describes the main kernel concepts, with particular reference to the kernel structures related to filesystem activity and how the main file access-based system calls were implemented.

The structures, kernel subsystems, and flow of control through the research edition UNIX kernels are still largely intact after more than 25 years of development. Thus, the simple approaches described in this chapter are definitely a prerequisite to understanding the more complex UNIX implementations found today.

5th to 7th Edition Internals

From the mid 1980s onwards, there have been a number of changes in the UNIX kernel that resulted in the mainstream kernels diverging in their implementation. For the first fifteen years of UNIX development, there wasn_ft a huge difference in the way many kernel subsystems were implemented, and therefore understanding the principles behind these earlier UNIX versions will help readers understand how the newer kernels have changed.

The earliest documented version of UNIX was 6th Edition, which can be 106 UNIX Filesystems.Evolution, Design, and Implementation seen in John Lions_f book Lions_f Commentary on UNIX 6th Edition.with source code [LION96]. It is now also possible to download free versions of UNIX from 5th Edition onwards. The kernel source base is very small by today_fs standards. With less than 8,000 lines of code for the whole kernel, it is easily possible to gain an excellent understanding of how the kernel worked. Even the small amounts of assembler code do not need significant study to determine their operation.

This chapter concentrates on kernel principles from a filesystem perspective. Before describing the newer UNIX implementations, it is first necessary to explain some fundamental UNIX concepts. Much of the description here centers around the period covering 5th to 7th Edition UNIX, which generally covers the first ten years of UNIX development. Note that the goal here is to avoid swamping the reader with details; therefore, little knowledge of UNIX kernel internals is required in order to read through the material with relative ease.

Note that at this early stage, UNIX was a uniprocessor-based kernel. It would be another 10 years before mainstream multiprocessor-based UNIX versions first started to appear.

The UNIX Filesystem

Before describing how the different kernel structures work together, it is first necessary to describe how the original UNIX filesystem was stored on disk. Figure 6.1 shows the layout of various filesystem building blocks. The first (512 byte) block was unused. The second block (block 1) held the superblock, a structure that holds information about the filesystem as a whole such as the number of blocks in the filesystem, the number of inodes (files), and the number of free inodes and data blocks. Each file in the filesystem was represented by a unique inode that contained fields such as:

i_mode. This field specifies whether the file is a directory (IFDIR), a block special file (IFBLK), or a character special file (IFCHR). Note that if one of the above modes was not set, the file was assumed to be a regular file. This would later be replaced by an explicit flag, IFREG.

i_nlink. This field recorded the number of hard links to the file. When this field reaches zero, the inode is freed.

i_uid. The file_fs user ID.

i_gid. The file_fs group ID.

i_size. The file size in bytes.

i_addr. This field holds block addresses on disk where the file_fs data blocks are held.

i_mtime. The time the file was last modified.

i_atime. The time that the file was last accessed.
The i_addr field was an array of 8 pointers. Each pointer could reference a single disk block, giving 512 bytes of storage or could reference what is called an indirect block. Each indirect block contained 32 pointers, each of which could point to a 512 byte block of storage or a double indirect block. Double indirects point to indirect data blocks. Figure 6.2 shows the two extremes whereby data blocks are accessed directly from the inode or from double indirects.

In the first example, the inode directly references two data blocks. The file size in this case will be between 513 and 1024 bytes in size. If the size of the file is less than 512 bytes, only a single data block is needed. Elements 2 to 7 of the i_addr[] array will be NULL in this case.

The second example shows the maximum possible file size. Each element of i_addr[] references an indirect block. Each indirect block points to 32 double indirect blocks, and each double indirect block points to 32 data blocks. This gives a maximum file size of 32 * 32 * 32 = 32,768 data blocks.

Filesystem-Related Kernel Structures

This section describes the main structures used in the UNIX kernel that are related to file access, from the file descriptor level down to issuing read and write calls to the disk driver.

User Mode and Kernel Mode

Each UNIX process is separated both from other processes and from the kernel through hardware-protection mechanisms. Thus, one process is unable to access the address space of another and is unable to either read from or write to the kernel data structures.

When a process is running it can either be in user mode or kernel mode. When in user mode it runs on its own stack and executes instructions from the application binary or one of the libraries that it may be linked with. In order to execute a system call, the process transfers to kernel mode by issuing a special hardware instruction. When in the kernel, all arguments related to the system call are copied into the kernel_fs address space. Execution proceeds on a separate kernel stack. A context switch (a switch to another user process) can take place prior to returning to the user process if the timeslice of that process has been exceeded or if the process goes to sleep (for example, while waiting for an I/O operation).

The mechanisms for transferring control between user and kernel mode are dependent on the hardware architecture. Information about each process is divided between two different kernel structures. The proc structure is always present in memory, while the user structure holds information that is only needed when the process is running. Thus, when a process is not running and is eligible to be swapped out, all structures related to the process other than the proc structure may be written to the swap device. Needless to say, the proc structure must record information about where on the swap device the other process-related structures are located.

The proc structure does not record information related to file access. However the user structure contains a number of important file-access-related fields, namely: u_cdir. The inode of the current working directory is stored here. This is used during pathname resolution when a user specifies a relative pathname.

u_uid/u_gid. The process user ID and group ID used for permissions checking for file-access-based system calls. Similarly, u_euid and >u_egid hold the effective user and group IDs.

u_ofile. This array holds the process file descriptors. This is described in more detail later.

u_arg. An array of system call arguments set up during the transition from user to kernel mode when invoking a system call.

u_base. This field holds the address of a user space buffer in which to read data from or write data to when processing a system call such as read() or write().

u_count. The number of bytes to read or write is held here. It is decremented during the I/O operation and the result can be passed back to the user.

u_offset. This field records the offset within the file for the current read or write operation.

u_error. When processing a system call, this field is set if an error is encountered. The value of u_error is then passed back to the user when the system call returns.

There are other fields which have significance to file-access-based calls. However, these fields became redundant over the years and to avoid bloating this section, they won_ft be described further.

Users familiar with the chroot() system call and later versions of UNIX may have been wondering why there is no u_rdir to hold the current, per-process root director.at this stage in UNIX development, chroot() had not been implemented.

File Descriptors and the File Table

The section File Descriptors, in Chapter 2, described how file descriptors are returned from system calls such as open(). The u_ofile[] array in the user structure is indexed by the file descriptor number to locate a pointer to a file structure.

In earlier versions of UNIX, the size of the u_ofile[] array was hard coded and had NOFILE elements. Because the stdin, stdout, and stderr file descriptors occupied slots 0, 1, and 2 within the array, the first file descriptor returned in response to an open() system call would be 3. For the early versions of UNIX, NOFILE was set at 15. This would then make its way to 20 by the time that 7th Edition appeared.

The file structure contains more information about how the file was opened and where the current file pointer is positioned within the file for reading or writing. It contained the following members:

f_flag. This flag was set based on how the file was opened. If open for reading it was set to FREAD, and if open for writing it was set to FWRITE.

f_count. Each file structure had a reference count. This field is further described below.

f_inode. After a file is opened, the inode is read in from disk and stored in an in-core inode structure. This field points to the in-core inode.

f_offset. This field records the offset within the file when reading or writing. Initially it will be zero and will be incremented by each subsequent read or write or modified by lseek().

The file structure contains a reference count. Calls such as dup() result in a new file descriptor being allocated that points to the same file table entry as the original file descriptor. Before dup() returns, the f_count field is incremented.

Although gaining access to a running 5th Edition UNIX system is a little difficult 27 years after it first appeared, it is still possible to show how these concepts work in practice on more modern versions of UNIX. Take for example the following program running on Sun_fs Solaris version 8:

#include < fcntl.h>
 main()
 {
 int fd1, fd2;
 fd1 = open("/etc/passwd", O_RDONLY);
 fd2 = dup(fd1);
 printf("fd1 = %d, fd2 = %d\n", fd1, fd2);
 pause();
 }

The crash program can be used to analyze various kernel structures. In this case, it is possible to run the preceding program, locate the process with crash, and then display the corresponding user and file structures.

First of all, the program is run in the background, which displays file descriptor values of 3 and 4 as expected. The crash utility is then run and the proc command is used in conjunction with grep to locate the process in question as shown here:

# ./mydup&
 [1] 1422
 fd1 = 3, fd2 = 4
 # crash
 dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
 > proc ! grep mydup
 37 s 1422 1389 1422 1389 0 46 mydup load

The process occupies slot 37 (consider this as an array of proc structures). The slot number can be passed to the user command that displays the user area corresponding to the process. Not all of the structure is shown here although it easy to see some relevant information about the process including the list of file descriptors. Note that file descriptor values 0, 1, and 2 all point to the same file table entry. Also, because a call was made to dup() in the program, entries 3 and 4 in the array point to the same file table entry.

> user 37
 PER PROCESS USER AREA FOR PROCESS 37
 PROCESS MISC:
 command: mydup, psargs: ./mydup
 start: Sat Jul 28 08:50:16 2001
 mem: 90, type: exec su-user
 vnode of current directory: 300019b5468
 OPEN FILES, FLAGS, AND THREAD REFCNT:
 [0]: F 30000adad68, 0, 0 [1]: F 30000adad68, 0, 0
 [2]: F 30000adad68, 0, 0 [3]: F 30000adb078, 0, 0
 [4]: F 30000adb078, 0, 0
 ...

Finally, the file command can be used to display the file table entry corresponding to these file descriptors. Note that the reference count is now 2, the offset is 0 because no data has been read and the flags hold FREAD as indicated by the read flag displayed.

> file 30000adb078
 ADDRESS RCNT TYPE/ADDR OFFSET FLAGS
 30000adb078 2 UFS /30000aafe30 0 read

With the exception that this file structure points to a vnode as opposed to the old in-core inode, the main structure has remained remarkably intact for UNIX_fs 30+ year history.

The Inode Cache

Each file is represented on disk by an inode. When a file is opened, the inode must be retrieved from disk. Operations such as the stat() system call retrieve much of the information they require from the inode structure.

The inode must remain in memory for the duration of the open and is typically written back to disk if any operations require changes to the inode structure. For example, consider writing 512 bytes of data at the end of the file that has an existing size of 512 bytes and therefore one block allocated (referenced by i_addr[0]). This will involve changing i_size to 1024 bytes, allocating a new block to the file, and setting i_addr[1] to point to this newly allocated block. These changes will be written back to disk.

After the file has been closed and there are no further processes holding the file open, the in-core inode can be freed.

If the inode were always freed on close, however, it would need to be read in again from disk each time the file is opened. This is very costly, especially considering that some inodes are accessed frequently such as the inodes for /, /usr, and /usr/bin. To prevent this from happening, inodes are retained in an inode cache even when the inode is no longer in use. Obviously if new inodes need to be read in from disk, these unused, cached inodes will need to be reallocated.

Figure 6.3 shows the linkage between file descriptors and inodes. The top process shows that by calling dup(), a new file descriptor is allocated resulting in fdb and fdc both pointing to the same file table entry. The file table entry then points to the inode for /etc/passwd.

For the bottom process, the open of /etc/passwd results in allocation of both a new file descriptor and file table entry. The file table entry points to the same in-core copy of the inode for this file as referenced by the top process. To handle these multiple references, the i_count field is used. Each time a file is opened, i_count is incremented and subsequently decremented on each close. Note that the inode cannot be released from the inode cache until after the last close.

The Buffer Cache

Devices were and still are accessed by the device ID and block number. Device IDs are constructed from the device major number and minor number. The major number has traditionally been nothing more than an entry into an array of vectors pointing to device driver entry points. Block special files are accessed through the bdevsw[] array while character special files are accessed through the cdevsw[] array. Both arrays were traditionally hard coded into the kernel. Filesystems access the disk through the block driver interface for which the disk driver exports a strategy function that is called by the filesystem.

Each driver, through its exported strategy function, accepts a buf structure that contains all the necessary information required to perform the I/O.
The buf structure has actually changed very little over the years. Around 5th Edition it contained the following fields:

int b_flags;
 struct buf *b_forw;
 struct buf *b_back;
 struct buf *av_forw;
 struct buf *av_back;
 int b_dev;
 char *b_addr;
 char *b_blkno;
 char b_error;
 char *b_resid;

The b_forw and b_back fields can be used by the device driver to chain related buffers together. After I/O is complete and the buffer is freed, the av_forw and av_back fields are used to hold the buffer on the free list. Note that buffers on the free list retain their identity until reused and thus act as a cache of recently accessed blocks. The b_dev and b_blkno fields are used to associate the buffer with a particular device and block number, while the b_addr field points to an in-core buffer that holds the data read or to be written. The b_wcount, b_error, and b_resid fields are used during I/O and will be described in the section Putting It All Together later in this chapter.

The b_flags field contains information about the state of the buffer. Some of the possible flags are shown below:

B_WRITE. A call to the driver will cause the buffer contents to be written to block b_blkno within the device specified by b_dev.

B_READ. A call to the driver will read the block specified by b_blkno and b_dev into the buffer data block referenced by b_addr.

B_DONE. I/O has completed and the data may be used.

B_ERROR. An error occurred while reading or writing.

B_BUSY. The buffer is currently in use.

B_WANTED. This field is set to indicate that another process wishes to use this buffer. After the I/O is complete and the buffer is relinquished, the kernel will wake up the waiting process.

When the kernel bootstraps, it initializes an array of NBUF buffers to comprise the buffer cache. Each buffer is linked together through the av_forw and av_back fields and headed by the bfreelist pointer.

The two main interfaces exported by the buffer cache are bread() and bwrite() for reading and writing respectively. Both function declarations are shown below:

struct buf *
 bread(int dev, int blkno)
 void
 bwrite(struct buf *bp);

Considering bread() first, it must make a call to getblk() to search for a buffer in the cache that matches the same device ID and block number. If the buffer is not in the cache, getblk() takes the first buffer from the free list, sets its identity to that of the device (dev) and block number (blkno), and returns it.

When bread() retrieves a buffer from getblk(), it checks to see if the B_DONE flag is set. If this is the case, the buffer contents are valid and the buffer can be returned. If B_DONE is not set, the block must be read from disk. In this case a call is made to the disk driver strategy routine followed by a call to iowait() to sleep until the data has been read

One final point worthy of mention at this stage is that the driver strategy interface is asynchronous. After the I/O has been queued, the device driver returns. Performing I/O is a time-consuming operation, so the rest of the system could be doing something else while the I/O is in progress. In the case shown above, a call is made to iowait(), which causes the current process to sleep until the I/O is complete. The asynchronous nature of the strategy function allowed read ahead to be implemented whereby the kernel could start an asynchronous read of the next block of the file so that the data may already be in memory when the process requests it. The data requested is read, but before returning to the user with the data, a strategy call is made to read the next block without a subsequent call to iowait().

To perform a write, a call is made to bwrite(), which simply needs to invoke the two line sequence previously shown.

After the caller has finished with the buffer, a call is made to brelse(), which takes the buffer and places it at the back of the freelist. This ensures that the oldest free buffer will be reassigned first.

Mounting Filesystems

The section The UNIX Filesystem, earlier in this chapter, showed how filesystems were laid out on disk with the superblock occupying block 1 of the disk slice. Mounted filesystems were held in a linked list of mount structures, one per filesystem with a maximum of NMOUNT mounted filesystems. Each mount structure has three elements, namely:

m_dev. This field holds the device ID of the disk slice and can be used in a simple check to prevent a second mount of the same filesystem.

m_buf. This field points to the superblock (struct filsys), which is read from disk during a mount operation.

m_inodp. This field references the inode for the directory onto which this filesystem is mounted. This is further explained in the section Pathname Resolution later in this chapter.

The root filesystem is mounted early on during kernel initialization. This involved a very simple code sequence that relied on the root device being hard coded into the kernel. The block containing the superblock of the root filesystem is read into memory by calling bread(); then the first mount structure is initialized to point to the buffer.

Any subsequent mounts needed to come in through the mount() system call. The first task to perform would be to walk through the list of existing mount structures checking m_dev against the device passed to mount(). If the filesystem is mounted already, EBUSY is returned; otherwise another mount structure is allocated for the new mounted filesystem.

System Call Handling

Arguments passed to system calls are placed on the user stack prior to invoking a hardware instruction that then transfers the calling process from user mode to kernel mode. Once inside the kernel, any system call handler needs to be able to access the arguments, because the process may sleep awaiting some resource, resulting in a context switch, the kernel needs to copy these arguments into the kernel address space.

The sysent[] array specifies all of the system calls available, including the number of arguments.

By executing a hardware trap instruction, control is passed from user space to the kernel and the kernel trap() function runs to determine the system call to be processed. The C library function linked with the user program stores a unique value on the user stack corresponding to the system call. The kernel uses this value to locate the entry in sysent[] to understand how many arguments are being passed.

For a read() or write() system call, the arguments are accessible as follows:

fd = u.u_ar0[R0]
 u_base = u.u_arg[0]
 u_count = u.u_arg[1]

This is a little strange because the first and subsequent arguments are accessed in a different manner. This is partly due to the hardware on which 5th Edition UNIX was based and partly due to the method that the original authors chose to handle traps.

If any error is detected during system call handling, u_error is set to record the error found. For example, if an attempt is made to mount an already mounted filesystem, the mount system call handler will set u_error to EBUSY. As part of completing the system call, trap() will set up the r0 register to contain the error code, that is then accessible as the return value of the system call once control is passed back to user space.

For further details on system call handling in early versions of UNIX, [LION96] should be consulted. Steve Pate_fs book UNIX Internals.A Practical Approach [PATE96] describes in detail how system calls are implemented at an assembly language level in System V Release 3 on the Intel x86 architecture.

Pathname Resolution

System calls often specify a pathname that must be resolved to an inode before the system call can continue. For example, in response to:

fd = open("/etc/passwd", O_RDONLY);

the kernel must ensure that /etc is a directory and that passwd is a file within the /etc directory.

Where to start the search depends on whether the pathname specified is absolute or relative. If it is an absolute pathname, the search starts from rootdir, a pointer to the root inode in the root filesystem that is initialized during kernel bootstrap. If the pathname is relative, the search starts from u_cdir, the inode of the current working directory. Thus, one can see that changing a directory involves resolving a pathname to a base directory component and then setting u_cdir to reference the inode for that directory.

The routine that performs pathname resolution is called namei(). It uses fields in the user area as do many other kernel functions. Much of the work of namei() involves parsing the pathname to be able to work on one component at a time. Consider, at a high level, the sequence of events that must take place to resolve /etc/passwd.

if (absolute pathname) {
 dip = rootdir
 } else {
 dip = u.u_cdir
 }
 loop:
 name = next component
 scan dip for name / inode number
 iput(dip)
 dip = iget() to read in inode
 if last component {
 return dip
 } else {
 goto loop
 }

This is an oversimplification but it illustrates the steps that must be performed. The routines iget() and iput() are responsible for retrieving an inode and releasing an inode respectively. A call to iget() scans the inode cache before reading the inode from disk. Either way, the returned inode will have its hold count (i_count) increased. A call to iput() decrements i_count and, if it reaches 0, the inode can be placed on the free list.

To facilitate crossing mount points, fields in the mount and inode structures are used. The m_inodp field of the mount structure points to the directory inode on which the filesystem is mounted allowing the kernel to perform a _g.._f_f traversal over a mount point. The inode that is mounted on has the IMOUNT flag set that allows the kernel to go over a mount point.

Putting It All Together

In order to describe how all of the above subsystems work together, this section will follow a call to open() on /etc/passwd followed by the read() and close() system calls.

Figure 6.4 shows the main structures involved in actually performing the read. It is useful to have this figure in mind while reading through the following sections.

Opening a File

The open() system call is handled by the open() kernel function. Its first task is to call namei() to resolve the pathname passed to open(). Assuming the pathname is valid, the inode for passwd is returned. A call to open1() is then made passing the open mode. The split between open() and open1() allows the open() and creat() system calls to share much of the same code.

First of all, open1() must call access() to ensure that the process can access the file according to ownership and the mode passed to open(). If all is fine, a call to falloc() is made to allocate a file table entry. Internally this invokes ufalloc() to allocate a file descriptor from u_ofile[]. The newly allocated file descriptor will be set to point to the newly allocated file table entry. Before returning from open1(), the linkage between the file table entry and the inode for passwd is established as was shown in Figure 6.3.

Reading the File

The read() and write() systems calls are handled by kernel functions of the same name. Both make a call to rdwr() passing FREAD or FWRITE. The role of rdwr() is fairly straightforward in that it sets up the appropriate fields in the user area to correspond to the arguments passed to the system call and invokes either readi() or writei() to read from or write to the file. The following pseudo code shows the steps taken for this initialization. Note that some of the error checking has been removed to simplify the steps taken.

get file pointer from user area
 set u_base to u.u_arg[0]; /* user supplied buffer */
 set u_count to u.u_arg[1]; /* number of bytes to read/write */
 if (reading) {
 readi(fp->f_inode);
 } else {
 writei(fp->f_inode);
 }

The internals of readi() are fairly straightforward and involve making repeated calls to bmap() to obtain the disk block address from the file offset. The bmap() function takes a logical block number within the file and returns the physical block number on disk. This is used as an argument to bread(), which reads in the appropriate block from disk. The uiomove() function then transfers data to the buffer specified in the call to read(), which is held in u_base. This also increments u_base and decrements u_count so that the loop will terminate after all the data has been transferred.

If any errors are encountered during the actual I/O, the b_flags field of the buf structure will be set to B_ERROR and additional error information may be stored in b_error. In response to an I/O error, the u_error field of the user structure will be set to either EIO or ENXIO.

The b_resid field is used to record how many bytes out of a request size of u_count were not transferred. Both fields are used to notify the calling process of how many bytes were actually read or written.

Closing the File

The close() system call is handled by the close() kernel function. It performs little work other than obtaining the file table entry by calling getf(), zeroing the appropriate entry in u_ofile[], and then calling closef(). Note that because a previous call to dup() may have been made, the reference count of the file table entry must be checked before it can be freed. If the reference count (f_count) is 1, the entry can be removed and a call to closei() is made to free the inode. If the value of f_count is greater than 1, it is decremented and the work of close() is complete.

To release a hold on an inode, iput() is invoked. The additional work performed by closei() allows a device driver close call to be made if the file to be closed is a device.

As with closef(), iput() checks the reference count of the inode (i_count). If it is greater than 1, it is decremented, and there is no further work to do. If the count has reached 1, this is the only hold on the file so the inode can be released. One additional check that is made is to see if the hard link count of the inode has reached 0. This implies that an unlink() system call was invoked while the file was still open. If this is the case, the inode can be freed on disk.

Summary

This chapter concentrated on the structures introduced in the early UNIX versions, which should provide readers with a basic grounding in UNIX kernel principles, particularly as they apply to how filesystems and files are accessed. It says something for the design of the original versions of UNIX that many UNIX based kernels still bear a great deal of similarity to the original versions developed over 30 years ago. Lions_f book Lions_f Commentary on UNIX 6th Edition [LION96] provides a unique view of how 6th Edition UNIX was implemented and lists the complete kernel source code. For additional browsing, the source code is available online for download. For a more concrete explanation of some of the algorithms and more details on the kernel in general, Bach_fs book The Design of the UNIX Operating System [BACH86] provides an excellent overview of System V Release 2. Pate_fs book UNIX Internals.A Practical Approach [PATE96] describes a System V Release 3 variant. The UNIX versions described in both books bear most resemblance to the earlier UNIX research editions.

CHAPTER 7 Development of the SVR4 VFS/Vnode Architecture

The development of the File System Switch (FSS) architecture in SVR3, the Sun VFS/vnode architecture in SunOS, and then the merge between the two to produce SVR4, substantially changed the way that filesystems were accessed and implemented. During this period, the number of filesystem types increased dramatically, including the introduction of commercial filesystems such as VxFS that allowed UNIX to move toward the enterprise computing market.

SVR4 also introduced a number of other important concepts pertinent to filesystems, such as tying file system access with memory mapped files, the DNLC (Directory Name Lookup Cache), and a separation between the traditional buffer cache and the page cache, which also changed the way that I/O was performed.

This chapter follows the developments that led up to the implementation of SVR4, which is still the basis of Sun_fs Solaris operating system and also freely available under the auspices of Caldera_fs OpenUNIX.

The Need for Change

The research editions of UNIX had a single filesystem type, as described in Chapter 6. The tight coupling between the kernel and the filesystem worked well 122 UNIX Filesystems.Evolution, Design, and Implementation at this stage because there was only one filesystem type and the kernel was single threaded, which means that only one process could be running in the kernel at the same time.

Before long, the need to add new filesystem types.including non-UNIX filesystems.resulted in a shift away from the old style filesystem implementation to a newer, cleaner architecture that clearly separated the different physical filesystem implementations from those parts of the kernel that dealt with file and filesystem access.

Pre-SVR3 Kernels

With the exception of Lions_f book on 6th Edition UNIX [LION96], no other UNIX kernels were documented in any detail until the arrival of System V Release 2 that was the basis for Bach_fs book The Design of the UNIX Operating System [BACH86]. In his book, Bach describes the on-disk layout to be almost identical to that of the earlier versions of UNIX.

There was little change between the research editions of UNIX and SVR2 to warrant describing the SVR2 filesystem architecture in detail. Around this time, most of the work on filesystem evolution was taking place at the University of Berkeley to produce the BSD Fast File System which would, in time, become UFS.

The File System Switch

Introduced with System V Release 3.0, the File System Switch (FSS) architecture introduced a framework under which multiple different filesystem types could coexist in parallel.

The FSS was poorly documented and the source code for SVR3-based derivatives is not publicly available. [PATE96] describes in detail how the FSS was implemented. Note that the version of SVR3 described in that book contained a significant number of kernel changes (made by SCO) and therefore differed substantially from the original SVR3 implementation. This section highlights the main features of the FSS architecture.

As with earlier UNIX versions, SVR3 kept the mapping between file descriptors in the user area to the file table to in-core inodes. One of the main goals of SVR3 was to provide a framework under which multiple different filesystem types could coexist at the same time. Thus each time a call is made to mount, the caller could specify the filesystem type. Because the FSS could support multiple different filesystem types, the traditional UNIX filesystem needed to be named so it could be identified when calling the mount command. Thus, it became known as the s5 (System V) filesystem. Throughout the USL-based development of System V through to the various SVR4 derivatives, little development would occur on s5. SCO completely restructured their s5-based filesystem over the years and added a number of new features.

The boundary between the filesystem-independent layer of the kernel and the filesystem-dependent layer occurred mainly through a new implementation of the in-core inode. Each filesystem type could potentially have a very different on-disk representation of a file. Newer diskless filesystems such as NFS and RFS had different, non-disk-based structures once again. Thus, the new inode contained fields that were generic to all filesystem types such as user and group IDs and file size, as well as the ability to reference data that was filesystem-specific. Additional fields used to construct the FSS interface were:

i_fsptr. This field points to data that is private to the filesystem and that is not visible to the rest of the kernel. For disk-based filesystems this field would typically point to a copy of the disk inode.

i_fstyp. This field identifies the filesystem type.

i_mntdev. This field points to the mount structure of the filesystem to which this inode belongs.

i_mton. This field is used during pathname traversal. If the directory referenced by this inode is mounted on, this field points to the mount structure for the filesystem that covers this directory.

i_fstypp. This field points to a vector of filesystem functions that are called by the filesystem-independent layer.

The set of filesystem-specific operations is defined by the fstypsw structure. An array of the same name holds an fstypsw structure for each possible filesystem. The elements of the structure, and thus the functions that the kernel can call into the filesystem with, are shown in Table 7.1.

When a file is opened for access, the i_fstypp field is set to point to the fstypsw[] entry for that filesystem type. In order to invoke a filesystem-specific function, the kernel performs a level of indirection through a macro that accesses the appropriate function. For example, consider the definition of FS_READI() that is invoked to read data from a file:

#define FS_READI(ip) (*fstypsw[(ip)->i_fstyp].fs_readi)(ip)

All filesystems must follow the same calling conventions such that they all understand how arguments will be passed. In the case of FS_READI(), the arguments of interest will be held in u_base and u_count. Before returning to the filesystem-independent layer, u_error will be set to indicate whether an error occurred and u_resid will contain a count of any bytes that could not be read or written.

Mounting Filesystems

The method of mounting filesystems in SVR3 changed because each filesystem_fs superblock could be different and in the case of NFS and RFS, there was no superblock per se. The list of mounted filesystems was moved into an array of mount structures that contained the following elements:

m_flags. Because this is an array of mount structures, this field was used to indicate which elements were in use. For filesystems that were mounted, m_flags indicates whether the filesystem was also mounted read-only.

m_fstyp. This field specified the filesystem type. m_bsize. The logical block size of the filesystem is held here. Each filesystem could typically support multiple different block sizes as the unit of allocation to a file.

m_dev. The device on which the filesystem resides.

m_bufp. A pointer to a buffer containing the superblock.

m_inodp. With the exception of the root filesystem, this field points to the inode on which the filesystem is mounted. This is used during pathname traversal.

m_mountp. This field points to the root inode for this filesystem.

m_name. The file system name.

Figure 7.1 shows the main structures used in the FSS architecture. There are a number of observations worthy of mention:

The structures shown are independent of filesystem type. The mount and inode structures abstract information about the filesystems and files that they represent in a generic manner. Only when operations go through the FSS do they become filesystem-dependent. This separation allows the FSS to support very different filesystem types, from the traditional s5 filesystem to DOS to diskless filesystems such as NFS and RFS.

Although not shown here, the mapping between file descriptors, the user area, the file table, and the inode cache remained as is from earlier versions of UNIX.

The Virtual Memory (VM) subsystem makes calls through the FSS to obtain a block map for executable files. This is to support demand paging. When a process runs, the pages of the program text are faulted in from the executable file as needed. The VM makes a call to FS_ALLOCMAP() to obtain this mapping. Following this call, it can invoke the FS_READMAP() function to read the data from the file when handling a page fault.

There is no clean separation between file-based and filesystem-based operations. All functions exported by the filesystem are held in the same fstypsw structure.

The FSS was a big step away from the traditional single filesystem-based UNIX kernel. With the exception of SCO, which retained an SVR3-based kernel for many years after the introduction of SVR3, the FSS was short lived, being replaced by the better Sun VFS/vnode interface introduced in SVR4.

The Sun VFS/Vnode Architecture

Developed on Sun Microsystem_fs SunOS operating system, the world first came to know about vnodes through Steve Kleiman_fs often-quoted Usenix paper

_gVnodes: An Architecture for Multiple File System Types in Sun UNIX_h [KLEI86]. The paper stated four design goals for the new filesystem architecture:

The filesystem implementation should be clearly split into a filesystem independent and filesystem-dependent layer. The interface between the two should be well defined.

It should support local disk filesystems such as the 4.2BSD Fast File System (FSS), non-UNIX like filesystems such as MS-DOS, stateless filesystems such as NFS, and stateful filesystems such as RFS.

It should be able to support the server side of remote filesystems such as NFS and RFS.

Filesystem operations across the interface should be atomic such that several operations do not need to be encompassed by locks.

One of the major implementation goals was to remove the need for global data, allowing the interfaces to be re-entrant. Thus, the previous style of storing filesystem-related data in the user area, such as u_base and u_count, needed to be removed. The setting of u_error on error also needed removing and the new interfaces should explicitly return an error value.

The main components of the Sun VFS architecture are shown in Figure 7.2. These components will be described throughout the following sections.

The architecture actually has two sets of interfaces between the filesystem-independent and filesystem-dependent layers of the kernel. The VFS interface was accessed through a set of vfsops while the vnode interface was accessed through a set of vnops (also called vnodeops). The vfsops operate on a filesystem while vnodeops operate on individual files.

Because the architecture encompassed non-UNIX- and non disk-based filesystems, the in-core inode that had been prevalent as the memory-based representation of a file over the previous 15 years was no longer adequate. A new type, the vnode was introduced. This simple structure contained all that was needed by the filesystem-independent layer while allowing individual filesystems to hold a reference to a private data structure; in the case of the disk-based filesystems this may be an inode, for NFS, an rnode, and so on.

The fields of the vnode structure were:

v_flag. The VROOT flag indicates that the vnode is the root directory of a filesystem, VNOMAP indicates that the file cannot be memory mapped, VNOSWAP indicates that the file cannot be used as a swap device, VNOMOUNT indicates that the file cannot be mounted on, and VISSWAP indicates that the file is part of a virtual swap device.

v_count. Similar to the old i_count inode field, this field is a reference count corresponding to the number of open references to the file.

v_shlockc. This field counts the number of shared locks on the vnode.

v_exlockc. This field counts the number of exclusive locks on the vnode.

v_vfsmountedhere. If a filesystem is mounted on the directory referenced
by this vnode, this field points to the vfs structure of the mounted filesystem. This field is used during pathname traversal to cross filesystem mount points.

v_op. The vnode operations associated with this file type are referenced through this pointer.

v_vfsp. This field points to the vfs structure for this filesystem.

v_type. This field specifies the type of file that the vnode represents. It can be set to VREG (regular file), VDIR (directory), VBLK (block special file), VCHR (character special file), VLNK (symbolic link), VFIFO (named pipe), or VXNAM (Xenix special file).

v_data. This field can be used by the filesystem to reference private data such as a copy of the on-disk inode.

There is nothing in the vnode that is UNIX specific or even pertains to a local filesystem. Of course not all filesystems support all UNIX file types. For example, the DOS filesystem doesn_ft support symbolic links. However, filesystems in the VFS/vnode architecture are not required to support all vnode operations. For those operations not supported, the appropriate field of the vnodeops vector will be set to fs_nosys, which simply returns ENOSYS.

The uio Structure

One way of meeting the goals of avoiding user area references was to package all I/O-related information into a uio structure that would be passed across the vnode interface. This structure contained the following elements:

uio_iov. A pointer to an array of iovec structures each specifying a base user address and a byte count.

uio_iovcnt. The number of iovec structures.

uio_offset. The offset within the file that the read or write will start from.

uio_segflg. This field indicates whether the request is from a user process (user space) or a kernel subsystem (kernel space). This field is required by the kernel copy routines.

uio_resid. The residual count following the I/O.

Because the kernel was now supporting filesystems such as NFS, for which requests come over the network into the kernel, the need to remove user area access was imperative. By creating a uio structure, it is easy for NFS to then make a call to the underlying filesystem.

The uio structure also provides the means by which the readv() and writev() system calls can be implemented. Instead of making multiple calls into the filesystem for each I/O, several iovec structures can be passed in at the same time.

The VFS Layer

The list of mounted filesystems is maintained as a linked list of vfs structures. As with the vnode structure, this structure must be filesystem independent. The vfs_data field can be used to point to any filesystem-dependent data structure, for example, the superblock.

Similar to the File System Switch method of using macros to access filesystem-specific operations, the vfsops layer utilizes a similar approach. Each filesystem provides a vfsops structure that contains a list of functions applicable to the filesystem. This structure can be accessed from the vfs_op field of the vfs structure. The set of operations available is:

vfs_mount. The filesystem type is passed to the mount command using the -F option. This is then passed through the mount() system call and is used to locate the vfsops structure for the filesystem in question. This function can be called to mount the filesystem.

vfs_unmount. This function is called to unmount a filesystem.

vfs_root. This function returns the root vnode for this filesystem and is called during pathname resolution.

vfs_statfs. This function returns filesystem-specific information in response to the statfs() system call. This is used by commands such as df.

vfs_sync. This function flushes file data and filesystem structural data to disk, which provides a level of filesystem hardening by minimizing data loss in the event of a system crash.

vfs_fid. This function is used by NFS to construct a file handle for a specified vnode.

vfs_vget. This function is used by NFS to convert a file handle returned by a previous call to vfs_fid into a vnode on which further operations can be performed.

The Vnode Operations Layer

All operations that can be applied to a file are held in the vnode operations vector defined by the vnodeops structure. The functions from this vector follow:

vop_open. This function is only applicable to device special files, files in the namespace that represent hardware devices. It is called once the vnode has been returned from a prior call to vop_lookup.

vop_close. This function is only applicable to device special files. It is called once the vnode has been returned from a prior call to vop_lookup.

vop_rdwr. Called to read from or write to a file. The information about the I/O is passed through the uio structure. vop_ioctl. This call invokes an ioctl on the file, a function that can be passed to device drivers.

vop_select. This vnodeop implements select().

vop_getattr. Called in response to system calls such as stat(), this vnodeop fills in a vattr structure, which can be returned to the caller via the stat structure.

vop_setattr. Also using the vattr structure, this vnodeop allows the caller to set various file attributes such as the file size, mode, user ID, group ID, and file times.

vop_access. This vnodeop allows the caller to check the file for read, write, and execute permissions. A cred structure that is passed to this function holds the credentials of the caller.

vop_lookup. This function replaces part of the old namei() implementation. It takes a directory vnode and a component name and returns the vnode for the component within the directory.

vop_create. This function creates a new file in the specified directory vnode. The file properties are passed in a vattr structure. Development of the SVR4 VFS/Vnode Architecture 131

vop_remove. This function removes a directory entry.

vop_link. This function implements the link() system call.

vop_rename. This function implements the rename() system call.

vop_mkdir. This function implements the mkdir() system call.

vop_rmdir. This function implements the rmdir() system call.

vop_readdir. This function reads directory entries from the specified directory vnode. It is called in response to the getdents() system call.

vop_symlink. This function implements the symlink() system call.

vop_readlink. This function reads the contents of the symbolic link.

vop_fsync. This function flushes any modified file data in memory to disk. It is called in response to an fsync() system call.

vop_inactive. This function is called when the filesystem-independent layer of the kernel releases its last hold on the vnode. The filesystem can then free the vnode.

vop_bmap. This function is used for demand paging so that the virtual memory (VM) subsystem can map logical file offsets to physical disk offsets.

vop_strategy. This vnodeop is used by the VM and buffer cache layers to read blocks of a file into memory following a previous call to vop_bmap().

vop_bread. This function reads a logical block from the specified vnode and returns a buffer from the buffer cache that references the data.

vop_brelse. This function releases the buffer returned by a previous call to vop_bread.

If a filesystem does not support some of these interfaces, the appropriate entry in the vnodeops vector should be set to fs_nosys(), which, when called, will return ENOSYS. The set of vnode operations are accessed through the v_op field of the vnode using macros as the following definition shows:

#define VOP_INACTIVE(vp, cr) \
 (*(vp)->v_op->vop_inactive)(vp, cr)

Pathname Traversal

Pathname traversal differs from the File System Switch method due to differences in the structures and operations provided at the VFS layer. Consider the example shown in Figure 7.3 and consider the following two scenarios:

1. A user types _gcd /mnt_f_f to move into the mnt directory.

2. A user is in the directory /mnt and types _gcd .._f_f to move up one level.

In the first case, the pathname is absolute, so a search will start from the root directory vnode. This is obtained by following rootvfs to the first vfs structure and invoking the vfs_root function. This returns the root vnode for the root filesystem (this is typically cached to avoid repeating this set of steps). A scan is then made of the root directory to locate the mnt directory. Because the vfs_mountedhere field is set, the kernel follows this link to locate the vfs structure for the mounted filesystem through which it invokes the vfs_root function for that filesystem. Pathname traversal is now complete so the u_cdir field of the user area is set to point to the vnode for /mnt to be used in subsequent pathname operations.
In the second case, the user is already in the root directory of the filesystem mounted on /mnt (the v_flag field of the vnode is set to VROOT). The kernel locates the mounted on vnode through the vfs_vnodecovered field. Because this directory (/mnt in the root directory) is not currently visible to users (it is hidden by the mounted filesystem), the kernel must then move up a level to the root directory. This is achieved by obtaining the vnode referenced by _g.._f_f in the /mnt directory of the root filesystem. Once again, the u_cdir field of the user area will be updated to reflect the new current working directory.

The Veneer Layer

To provide more coherent access to files through the vnode interface, the implementation provided a number of functions that other parts of the kernel could invoke. The set of functions is:

vn_open. Open a file based on its file name, performing appropriate The Sun VFS/vnode interface was a huge success. Its merger with the File System Switch and the SunOS virtual memory subsystem provided the basis for the SVR4 VFS/vnode architecture. There were a large number of other UNIX vendors who implemented the Sun VFS/vnode architecture. With the exception of the read and write paths, the different implementations were remarkably similar to the original Sun VFS/vnode implementation.

The SVR4 VFS/Vnode Architecture

System V Release 4 was the result of a merge between SVR3 and Sun Microsystems_f SunOS. One of the goals of both Sun and AT&T was to merge the Sun VFS/vnode interface with AT&T_fs File System Switch.

The new VFS architecture, which has remained largely unchanged for over 15 years, introduced and brought together a number of new ideas, and provided a clean separation between different subsystems in the kernel. One of the fundamental changes was eliminating the tight coupling between the filesystem and the VM subsystem which, although elegant in design, was particularly complicated resulting in a great deal of difficulty when implementing new filesystem types.

Changes to File Descriptor Management

A file descriptor had previously been an index into the u_ofile[] array. Because this array was of fixed size, the number of files that a process could have 134 UNIX Filesystems.Evolution, Design, and Implementation open was bound by the size of the array. Because most processes do not open a lot of files, simply increasing the size of the array is a waste of space, given the large number of processes that may be present on the system.

With the introduction of SVR4, file descriptors were allocated dynamically up to a fixed but tunable limit. The u_ofile[] array was removed and replaced by two new fields, u_nofiles, which specified the number of file descriptors that the process can currently access, and u_flist, a structure of type ufchunk that contains an array of NFPCHUNK (which is 24) pointers to file table entries. After all entries have been used, a new ufchunk structure is allocated, as shown in Figure 7.4.

The uf_pofile[] array holds file descriptor flags as set by invoking the fcntl() system call.

The maximum number of file descriptors is constrained by a per-process limit defined by the rlimit structure in the user area.

There are a number of per-process limits within the u_rlimit[] array. The u_rlimit[RLIMIT_NOFILE] entry defines both a soft and hard file descriptor limit. Allocation of file descriptors will fail once the soft limit is reached. The setrlimit() system call can be invoked to increase the soft limit up to that of the hard limit, but not beyond. The hard limit can be raised, but only by root.

The Virtual Filesystem Switch Table

Built dynamically during kernel compilation, the virtual file system switch table, underpinned by the vfssw[] array, contains an entry for each filesystem that can reside in the kernel. Each entry in the array is defined by a vfssw structure as shown below:

struct vfssw {
 char *vsw_name;
 int (*vsw_init)();
 struct vfsops *vsw_vfsops;
 }

The vsw_name is the name of the filesystem (as passed to mount -F). The vsw_init() function is called during kernel initialization, allowing the filesystem to perform any initialization it may require before a first call to mount().

Operations that are applicable to the filesystem as opposed to individual files are held in both the vsw_vfsops field of the vfssw structure and subsequently in the vfs_ops field of the vfs structure.

The fields of the vfs structure are shown below:

vfs_mount. This function is called to mount a filesystem.

vfs_unmount. This function is called to unmount a filesystem.

vfs_root. This function returns the root vnode for the filesystem. This is used during pathname traversal.
vfs_statvfs. This function is called to obtain per-filesystem-related statistics. The df command will invoke the statvfs() system call on filesystems it wishes to report information about. Within the kernel, statvfs() is implemented by invoking the statvfs vfsop.

vfs_sync. There are two methods of syncing data to the filesystem in SVR4, namely a call to the sync command and internal kernel calls invoked by the fsflush kernel thread. The aim behind fsflush invoking vfs_sync is to flush any modified file data to disk on a periodic basis in a similar way to which the bdflush daemon would flush dirty (modified) buffers to disk. This still does not prevent the need for performing a fsck after a system crash but does help harden the system by minimizing data loss.

vfs_vget. This function is used by NFS to return a vnode given a specified file handle.

vfs_mountroot. This entry only exists for filesystems that can be mounted as the root filesystem. This may appear to be a strange operation. However, in the first version of SVR4, the s5 and UFS filesystems could be mounted as root filesystems and the root filesystem type could be specified during UNIX installation. Again, this gives a clear, well defined interface between the rest of the kernel and individual filesystems.

There are only a few minor differences between the vfsops provided in SVR4 and those introduced with the VFS/vnode interface in SunOS. The vfs structure with SVR4 contained all of the original Sun vfs fields and introduced a few others including vfs_dev, which allowed a quick and easy scan to see if a filesystem was already mounted, and the vfs_fstype field, which is used to index the vfssw[] array to specify the filesystem type.

Changes to the Vnode Structure and VOP Layer

The vnode structure had some subtle differences. The v_shlockc and v_exlockc fields were removed and replaced by additional vnode interfaces to handle locking. The other fields introduced in the original vnode structure

remained and the following fields were added:

v_stream. If the file opened references a STREAMS device, the vnode field points to the STREAM head.

v_filocks. This field references any file and record locks that are held on the file.

v_pages. I/O changed substantially in SVR4 with all data being read and written through pages in the page cache as opposed to the buffer cache, which was now only used for meta-data (inodes, directories, etc.). All pages in-core that are part of a file are linked to the vnode and referenced through this field.

The vnodeops vector itself underwent more change. The vop_bmap(), the vop_bread(), vop_brelse(), and vop_strategy() functions were removed as part of changes to the read and write paths. The vop_rdwr() and vop_select() functions were also removed. There were a number of new functions added as follows:

vop_read. The vop_rdwr function was split into separate read and write vnodeops. This function is called in response to a read() system call.

vop_write. The vop_rdwr function was split into separate read and write vnodeops. This function is called in response to a write() system call.

vop_setfl. This function is called in response to an fcntl() system call where the F_SETFL (set file status flags) flag is specified. This allows the filesystem to validate any flags passed.

vop_fid. This function was previously a VFS-level function in the Sun VFS/vnode architecture. It is used to generate a unique file handle from which NFS can later reference the file.

vop_rwlock. Locking was moved under the vnode interface, and filesystems implemented locking in a manner that was appropriate to their own internal implementation. Initially the file was locked for both read and write access. Later SVR4 implementations changed the interface to pass one of two flags, namely LOCK_SHARED or LOCK_EXCL. This allowed for a single writer but multiple readers.

vop_rwunlock. All vop_rwlock invocations should be followed by a subsequent vop_rwunlock call.

vop_seek. When specifying an offset to lseek(), this function is called to determine whether the filesystem deems the offset to be appropriate. With sparse files, seeking beyond the end of file and writing is a valid UNIX operation, but not all filesystems may support sparse files. This vnode operation allows the filesystem to reject such lseek() calls.

vop_cmp. This function compares two specified vnodes. This is used in the area of pathname resolution.

vop_frlock. This function is called to implement file and record locking.

vop_space. The fcntl() system call has an option, F_FREESP, which allows the caller to free space within a file. Most filesystems only implement freeing of space at the end of the file making this interface identical to truncate().

vop_realvp. Some filesystems, for example, specfs, present a vnode and hide the underlying vnode, in this case, the vnode representing the device. A call to VOP_REALVP() is made by filesystems when performing a link() system call to ensure that the link goes to the underlying file and not the specfs file, that has no physical representation on disk.

vop_getpage. This function is used to read pages of data from the file in response to a page fault. vop_putpage. This function is used to flush a modified page of file data to disk.

vop_map. This function is used for implementing memory mapped files.

vop_addmap. This function adds a mapping.

vop_delmap. This function deletes a mapping.

vop_poll. This function is used for implementing the poll() system call.

vop_pathconf. This function is used to implement the pathconf() and fpathconf() system calls. Filesystem-specific information can be returned, such as the maximum number of links to a file and the maximum file size.

The vnode operations are accessed through the use of macros that reference the appropriate function by indirection through the vnode v_op field. For example, here is the definition of the VOP_LOOKUP() macro:

#define VOP_LOOKUP(vp,cp,vpp,pnp,f,rdir,cr) \ (*(vp)->v_op->vop_lookup)(vp,cp,vpp,pnp,f,rdir,cr)

The filesystem-independent layer of the kernel will only access the filesystem through macros. Obtaining a vnode is performed as part of an open() or creat() system call or by the kernel invoking one of the veneer layer functions when kernel subsystems wish to access files directly. To demonstrate the mapping between file descriptors, memory mapped files, and vnodes, consider the following example:

1 #include < sys/types.h>
 2 #include < sys/stat.h>
 3 #include < sys/mman.h>
 4 #include < fcntl.h>
 5 #include < unistd.h>
 6
 7 #define MAPSZ 4096
 8
 9 main()
 10 {
 11 char *addr, c;
 12 int fd1, fd2;
 138 UNIX Filesystems.Evolution, Design, and Implementation
 13
 14 fd1 = open("/etc/passwd", O_RDONLY);
 15 fd2 = dup(fd1);
 16 addr = (char *)mmap(NULL, MAPSZ, PROT_READ,
 17 MAP_SHARED, fd1, 0);
 18 close(fd1);
 19 c = *addr;
 20 pause();
 21 }

A file is opened and then dup() is called to duplicate the file descriptor. The file is then mapped followed by a close of the first file descriptor. By accessing the address of the mapping, data can be read from the file. The following examples, using crash and adb on Solaris, show the main structures involved and scan for the data read, which should be attached to the vnode through the v_pages field. First of all, the program is run and crash is used to locate the process:

# ./vnode&
 # crash
 dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
 > p ! grep vnode
 35 s 4365 4343 4365 4343 0 46 vnode load
 > u 35
 PER PROCESS USER AREA FOR PROCESS 35
 PROCESS MISC:
 command: vnode, psargs: ./vnode
 start: Fri Aug 24 10:55:32 2001
 mem: b0, type: exec
 vnode of current directory: 30000881ab0
 OPEN FILES, FLAGS, AND THREAD REFCNT:
 [0]: F 30000adaa90, 0, 0 [1]: F 30000adaa90, 0, 0
 [2]: F 30000adaa90, 0, 0 [4]: F 30000adac50, 0, 0
 ...

The p (proc) command displays the process table. The output is piped to grep to locate the process. By running the u (user) command and passing the process slot as an argument, the file descriptors for this process are displayed. The first file descriptor allocated (3) was closed and the second (4) retained as shown above.

The entries shown reference file table slots. Using the file command, the entry for file descriptor number 4 is displayed followed by the vnode that it references:

> file 30000adac50
 ADDRESS RCNT TYPE/ADDR OFFSET FLAGS
 30000adac50 1 UFS /30000aafe30 0 read
 > vnode -l 30000aafe30
 VCNT VFSMNTED VFSP STREAMP VTYPE RDEV VDATA VFILOCKS
 VFLAG
 3 0 104440b0 0 f 30000aafda0 0 -
 Development of the SVR4 VFS/Vnode Architecture 139
 mutex v_lock: owner 0 waiters 0
 Condition variable v_cv: 0

The file table entry points to a vnode that is then displayed using the vnode command. Unfortunately the v_pages field is not displayed by crash. Looking at the header file that corresponds to this release of Solaris, it is possible to see where in the structure the v_pages field resides. For example, consider the surrounding fields:

...
 struct vfs *v_vfsp; /* ptr to containing VFS */
 struct stdata *v_stream; /* associated stream */
 struct page *v_pages; /* vnode pages list */
 enum vtype v_type; /* vnode type */
 ...

The v_vfsp and v_type fields are displayed above so by dumping the area of memory starting at the vnode address, it is possible to display the value of v_pages. This is shown below:

> od -x 30000aafe30 8
 30000aafe30: 000000000000 cafe00000003 000000000000 0000104669e8
 30000aafe50: 0000104440b0 000000000000 0000106fbe80 0001baddcafe

There is no way to display page structures in crash, so the Solaris adb command is used as follows:

# adb -k
 physmem 3ac5
 106fbe80$ < page
 106fbe80: vnode hash vpnext
 30000aafe30 1073cb00 106fbe80
 106fbe98: vpprev next prev
 106fbe80 106fbe80 106fbe80
 106fbeb0: offset selock lckcnt
 0 0 0
 106fbebe: cowcnt cv io_cv
 0 0 0
 106fbec4: iolock_state fsdata state
 0 0 0

Note that the offset field shows a value of 0 that corresponds to the offset within the file that the program issues the mmap() call for.

Pathname Traversal

The implementation of namei() started to become incredibly complex in some versions of UNIX as more and more functionality was added to a UNIX kernel implementation that was really inadequate to support it. [PATE96] shows how 140 UNIX Filesystems.Evolution, Design, and Implementation namei() was implemented in SCO OpenServer, a derivative of SVR3 for which namei() became overly complicated. With the addition of new vnodeops, pathname traversal in SVR4 became greatly simplified.

Because one of the goals of the original Sun VFS/vnode architecture was to support non-UNIX filesystems, it is not possible to pass a full pathname to the filesystem and ask it to resolve it to a vnode. Non-UNIX filesystems may not recognize the _g/_f_f character as a pathname component separator, DOS being a prime example. Thus, pathnames are resolved one component at a time.

The lookupname() function replaced the old namei() function found in earlier versions of UNIX. This takes a pathname structure and returns a vnode (if the pathname is valid). Internally, lookupname() allocates a pathname structure and calls lookuppn() to actually perform the necessary parsing and component lookup. The steps performed by lookuppn() are as follows:

if (absolute_pathname) {
 dirvp = rootdir
 } else {
 dirvp = u.u_cdir
 }
 do {
 name = extract string from pathname
 newvp = VOP_LOOKUP(dirvp, name, ...)
 if not last component {
 dirvp = newvp
 }
 } until basename of pathname reached
 return newvp

This is a fairly simple task to perform. Obviously, users can add all sorts of character combinations, and _g._f_f and _g.._f_f in the specified pathname, so there is a lot of string manipulation to perform which complicates the work of lookuppn().

The Directory Name Lookup Cache

The section The Inode Cache in Chapter 6 described how the inode cache provided a means by which to store inodes that were no longer being used. This helped speed up access during pathname traversal if an inode corresponding to a component in the pathname was still present in the cache.

Introduced initially in 4.2BSD and then in SVR4, the directory name lookup cache (DNLC) provides an easy and fast way to get from a pathname to a vnode. For example, in the old inode cache method, parsing the pathname /usr/lib/fs/vxfs/bin/mkfs would involve working on each component of the pathname one at a time. The inode cache merely saved going to disk during processing of iget(), not to say that this isn_ft a significant performance enhancement. However it still involved a directory scan to locate the appropriate inode number. With the DNLC, a search may be made by the name component alone. If the entry is cached, the vnode is returned. At hit rates over 90 percent, this results in a significant performance enhancement.

The DNLC is a cache of ncache structures linked on an LRU (Least Recently Used) list. The main elements of the structure are shown below and the linkage between elements of the DNLC is shown in Figure 7.5.

name. The pathname stored.

namelen. The length of the pathname.

vp. This field points to the corresponding vnode.

dvp. The credentials of the file_fs owner.

The ncache structures are hashed to improve lookups. This alleviates the need for unnecessary string comparisons. To access an entry in the DNLC, a hash value is calculated from the filename and parent vnode pointer. The appropriate entry in the nc_hash[] array is accessed, through which the cache can be searched. There are a number of DNLC-provided functions that are called by both the filesystem and the kernel.

dnlc_enter. This function is called by the filesystem to add an entry to the DNLC. This is typically called during pathname resolution on a successful VOP_LOOKUP() call. It is also called when a new file is created or after other operations which involve introducing a new file to the namespace such as creation of hard and symbolic links, renaming of files, and creation of directories.

dnlc_lookup. This function is typically called by the filesystem during pathname resolution. Because pathnames are resolved one entry at a time, the parent directory vnode is passed in addition to the file name to search for. If the entry exists, the corresponding vnode is returned, otherwise NULL is returned.

dnlc_remove. Renaming of files and removal of files are functions for which the entry in the DNLC must be removed.

dnlc_purge_vp. This function can be called to remove all entries in the cache that reference the specified vnode. dnlc_purge_vfsp. When a filesystem is to be unmounted, this function is called to remove all entries that have vnodes associated with the filesystem that is being unmounted.

dnlc_purge1. This function removes a single entry from the DNLC. SVR4 does not provide a centralized inode cache as found in earlier versions of UNIX. Any caching of inodes or other filesystem-specific data is the responsibility of the filesystem. This function was originally implemented to handle the case where an inode that was no longer in use has been removed from the inode cache.
As mentioned previously, there should be a hit rate of greater than 90 percent in the DNLC; otherwise it should be tuned appropriately. The size of the DNLC is determined by the tunable ncsize and is typically based on the maximum number of processes and the maximum number of users.

Filesystem and Virtual Memory Interactions

With the inclusion of the SunOS VM subsystem in SVR4, and the integration between the filesystem and the Virtual Memory (VM) subsystem, the SVR4 VFS architecture radically changed the way that I/O took place. The buffer cache changed in usage and a tight coupling between VM and filesystems together with page-based I/O involved changes throughout the whole kernel from filesystems to the VM to individual disk drivers.

Consider the old style of file I/O that took place in UNIX up to and including SVR3. The filesystem made calls into the buffer cache to read and write file data. For demand paging, the File System Switch architecture provided filesystem interfaces to aid demand paging of executable files, although all file data was still read and written through the buffer cache.

This was still largely intact when the Sun VFS/vnode architecture was introduced. However, in addition to their VFS/vnode implementation, Sun Microsystems introduced a radically new Virtual Memory subsystem that was, in large part, to become the new SVR4 VM.

The following sections describe the main components and features of the SVR4 VM together with how file I/O takes place. For a description of the SunOS implementation, consult the Usenix paper _gVirtual Memory Architecture in SunOS_h [GING87].

An Overview of the SVR4 VM Subsystem

The memory image of each user process is defined by an as (address space) structure that references a number of segments underpinned by the seg structure. Consider a typical user process. The address space of the process will include separate segments for text, data, and stack, in addition to various libraries, shared memory, and memory-mapped files as shown pictorially in Figure 7.6.

The seg structure defines the boundaries covering each segment. This includes the base address in memory together with the size of the segment.

There are a number of different segment types. Each segment type has an array of segment-related functions in the same way that each vnode has an array of vnode functions. In the case of a page fault, the kernel will call the fault() function for the specified segment causing the segment handler to respond by reading in the appropriate data from disk. When a process is forked, the dup() function is called for each segment and so on.

For those segments such as process text and data that are backed by a file, the segvn segment type is used. Each segvn segment has associated private, per-segment data that is accessed through the s_data field of the seg structure. This particular structure, segvn_data, contains information about the segment as well as the underlying file. For example, segvn segment operations need to know whether the segment is read-only, read/write, or whether it has execute access so that it can respond accordingly to a page fault. As well as referencing the vnode backing the segment, the offset at which the segment is mapped to the file must be known. As a hypothetical example, consider the case where user text is held at an offset of 0x4000 from the start of the executable file. If a page fault occurs within the text segment at the address s_base + 0x2000, the segment page fault handler knows that the data must be read from the file at an offset of 0x4000 + 0x2000 = 0x6000.

After a user process starts executing, there will typically be no physical pages of data backing these segments. Thus, the first instruction that the process executes will generate a page fault within the segment covering the instruction. The kernel page fault handler must first determine in which segment the fault occurred. This is achieved using the list of segments referenced by the process as structure together with the base address and the size of each segment. If the address that generated the page fault does not fall within the boundaries of any of the process segments, the process will be posted a SIGSEGV, which will typically result in the process dumping core.

To show how these structures are used in practice, consider the following invocation of the sleep(1) program:

$ /usr/bin/sleep 100000&

Using crash, the process can be located and the list of segments can be displayed as follows:

# crash
 dumpfile = /dev/mem, namelist = /dev/ksyms, outfile = stdout
 > p ! grep sleep
 32 s 7719 7694 7719 7694 0 46 sleep load
 > as -f 32
 PROC PAGLCK CLGAP VBITS HAT HRM RSS
 SEGLST LOCK SEGS SIZE LREP TAIL NSEGS
 32 0 0 0x0 0x4f958 0x0
 0xb10070 0x7fffefa0 0xb5aa50 950272 0 0xb3ccc0 14
 BASE SIZE OPS DATA
 0x 10000 8192 segvn_ops 0x30000aa46b0
 0x 20000 8192 segvn_ops 0x30000bfa448
 0x 22000 8192 segvn_ops 0x30000b670f8
 0xff280000 679936 segvn_ops 0x30000aa4e40
 0xff336000 24576 segvn_ops 0x30000b67c50
 0xff33c000 8192 segvn_ops 0x30000bfb260
 0xff360000 16384 segvn_ops 0x30000bfac88
 0xff372000 16384 segvn_ops 0x30000bface0
 0xff380000 16384 segvn_ops 0x30001af3f48
 0xff3a0000 8192 segvn_ops 0x30000b677d8
 0xff3b0000 8192 segvn_ops 0x30000b239d8
 0xff3c0000 131072 segvn_ops 0x30000b4c5e0
 0xff3e0000 8192 segvn_ops 0x30000b668b8
 0xffbee000 8192 segvn_ops 0x30000bfad38

There are 14 different segment types used to construct the address space, all of which are segvn type segments. Looking at the highlighted segment, the segvn private data structure associated with this segment can be displayed within adb as follows:

0x30000aa4e40$ < segvn
 30000aa4e40: lock
 30000aa4e40: wwwh
 0
 30000aa4e48: pageprot prot maxprot
 0 015 017
 30000aa4e4b: type offset vp
 02 0 30000749c58
 30000aa4e60: anon_index amp vpage
 0 0 0
 30000aa4e78: cred swresv advice
 30000429b68 0 0

The vnode representing the file backing this segment together with the offset within the file are displayed. The vnode and inode commands can be used to display both the vnode and the underlying UFS inode:

30000749c58$ < vnode
 30000749c60: flag refcnt vfsmnt
 1000 63 0
 30000749c70: op vfsp stream
 ufs_vnodeops 104440b0 0
 30000749c88: pages type rdev
 107495e0 1 0
 30000749ca0: data filocks shrlocks
 30000749bc8 0 0
 ...
 30000749bc8$ < inode
 ...
 30000749ce0: number diroff ufsvfs
 50909 0 3000016ee18
 ...

Finally, the following library is displayed whose inode number matches the inode displayed above.

# ls -i /usr/lib/libc.so.1
 50909 /usr/lib/libc.so.1

An interesting exercise to try is to run some of the programs presented in the book, particularly those that use memory-mapped files, map the segments displayed back to the specific file on disk, and note the file offsets and size of the segments in question.

The segvn segment type is of most interest to filesystem writers. Other segments include seg_u for managing user areas, seg_kmem for use by the kernel virtual memory allocator, and seg_dev, which is used to enable applications to memory-map devices.

The kernel address space is managed in a similar manner to the user address space in that it has its own address space structure referenced by the kernel variable k_as. This points to a number of different segments, one of which represents the SVR4 page cache that is described later in this chapter.

Anonymous Memory

When a process starts executing, the data section may be modified and therefore, once read from the file, loses its file association thereafter. All such segvn segments contain a reference to the original file where the data must be read from but also contain a reference to a set of anonymous pages.

Every anonymous page has reserved space on the swap device. If memory becomes low and anonymous pages need to be paged out, they can be written to the swap device and read back into memory at a later date. Anonymous pages are described by the anon structure, which contains a reference count as well as a pointer to the actual page. It also points to an entry within an si_anon[] array for which there is one per swap device. The location within this array determines the location on the swap device where the page of memory will be paged to if necessary. This is shown pictorially in Figure 7.7.

File I/O through the SVR4 VFS Layer

SVR4 implemented what is commonly called the page cache through which all file data is read and written. This is actually a somewhat vague term because the page cache differs substantially from the fixed size caches of the buffer cache, DNLC, and other types of caches.

The page cache is composed of two parts, a segment underpinned by the seg_map segment driver and a list of free pages that can be used for any purpose. Thus, after a page of file data leaves the cache, it is added to the list of free pages. While the page is on the free list, it still retains its identity so that if the kernel wishes to locate the same data prior to the page being reused, the page is removed from the free list and the data does not need to be re-read from disk. The main structures used in constructing the page cache are shown in Figure 7.8.

The segmap structure is part of the kernel address space and is underpinned by the segmap_data structure that describes the properties of the segment. The size of the segment is tunable and is split into MAXBSIZE (8KB) chunks where each 8KB chunk represents an 8KB window into a file. Each chunk is referenced by an smap structure that contains a pointer to a vnode for the file and the offset within the file. Thus, whereas the buffer cache references file data by device and block number, the page cache references file data by vnode pointer and file offset.
Two VM functions provide the basis for performing I/O in the new SVR4 model. The first function, shown below, is used in a similar manner to getblk() to essentially return a new entry in the page cache or return a previously cached

entry:
 addr_t
 segmap_getmap(struct seg *seg, vnode_t *vp, uint_t *offset);

The seg argument is always segkmap. The remaining two arguments are the vnode and the offset within the vnode where the data is to be read from or written to. The offset must be in 8KB multiples from the start of the file.

The address returned from segmap_getmap() is a kernel virtual address within the segmap segment range s_base to s_base + s_size. When the page cache is first initialized, the first call to segmap_getmap() will result in the first smap structure being used. The sm_vp and sm_off fields are updated to hold the vnode and offset passed in, and the virtual address corresponding to this entry is returned. After all slots in the segmap window have been used, the segmap driver must reuse one of the existing slots. This works in a similar manner to the buffer cache where older buffers are reused when no free buffers are available. After a slot is reallocated, the pages backing that slot are placed on the free list. Thus, the page cache essentially works at two levels with the page free list also acting as a cache.

The segmap_release() function, shown below, works in a similar way to brelse() by allowing the entry to be reused:

int segmap_release(struct seg *seg, addr_t addr, u_int flags)

This is where the major difference between SVR4 and other UNIX kernels comes into play. The virtual address returned by segmap_getmap() will not have any associated physical pages on the first call with a specific vnode and offset. Consider the following code fragment, which is used by the filesystem to read from an offset of 8KB within a file and read 1024 bytes:

kaddr = segmap_getmap(segkmap, vp, 8192);
 uiomove(kaddr, 1024, UIO_READ, uiop);
 segmap_release(segkmap, kaddr, SM_FREE);

The uiomove() function is called to copy bytes from one address to another. Because there are no physical pages backing kaddr, a page fault will occur. Because the kernel address space, referenced by kas, contains a linked list of segments each with a defined start and end address, it is easy for the page fault handling code to determine which segment fault handler to call to satisfy the page fault. In this case the s_fault() function provided with the segmap driver will be called as follows:

segkmap->s_ops->fault(seg, addr, ssize, type, rw);

By using the s_base and addr arguments passed to the fault handler, the appropriate vnode can be located from the corresponding smap structure. A call is then made to the filesystem_fs VOP_GETPAGE() function, which must allocate the appropriate pages and read the data from disk before returning. After this is all complete, the page fault is satisfied and the uiomove() function continues.

A pictorial view of the steps taken when reading a file through the VxFS filesystem is shown in Figure 7.9.

To write to a file, the same procedure is followed up to the point where segmap_release() is called. The flags argument determines what happens to the pages once the segment is released. The values that flags can take are:

SM_WRITE. The pages should be written, via VOP_PUTPAGE(), to the file once the segment is released.

SM_ASYNC. The pages should be written asynchronously.

SM_FREE. The pages should be freed.

SM_INVAL. The pages should be invalidated.

SM_DONTNEED. The filesystem has no need to access these pages again. If no flags are specified, the call to VOP_PUTPAGE() will not occur. This is the default behavior when reading from a file.

Memory-Mapped File Support in SVR4

A call to mmap() will result in a new segvn segment being attached to the calling process_f address space. A call will be made to the filesystem VOP_MAP() function, which performs some level of validation before calling the map_addr() function to actually initialize the process address space with the new segment.

Page faults on the mapping result in a very similar set of steps to page faults on the segmap segment. The segvn fault handler is called with the process address space structure and virtual address. Attached to the private data of this segment will be the vnode, the offset within the file that was requested of mmap(), and a set of permissions to indicate the type of mapping.

In the simple case of a memory read access, the segvn driver will call VOP_GETPAGE() to read in the requested page from the file. Again, the filesystem will allocate the page and read in the contents from disk.

In the following program, /etc/passwd is mapped. The following text then shows how to display the segments for this process and from there show the segvn segment for the mapped region and show how it points back to the passwd 150 UNIX Filesystems.Evolution, Design, and Implementation file so that data can be read and written as appropriate. The program is very straightforward, mapping an 8KB chunk of the file from a file offset of 0.

1 #include < sys/types.h>
 2 #include < sys/stat.h>
 3 #include < sys/mman.h>
 Figure 7.9 Reading from a file via the SVR4 page cache.
 VxFS
 vx_getpage(vp, ...)
 {
 allocate pages
 read data from disk
 }
 segmap_fault()
 {
 vp = sm_vp
 VOP_GETPAGE(vp, ...)
 }
 as_fault()
 {
 locate segment
 call s_fault()
 }
 read()
 {
 fp = getf(fd)
 vp = fp->f_vnode
 VOP_READ(vp, ...)
 }
 
 4 #include < fcntl.h>
 5 #include < unistd.h>
 6
 7 #define MAPSZ 4096
 8
 9 main()
 10 {
 11 char *addr, c;
 12 int fd;
 13
 14 fd = open("/etc/passwd", O_RDONLY);
 15 addr = (char *)mmap(NULL, MAPSZ,
 16 PROT_READ, MAP_SHARED, fd, 0);
 17 printf("addr = 0x%x\n", addr);
 18 c = *addr;
 19 pause();
 20 }

After running the program, it can be located with crash as follows. Using the program slot, the as (address space) for the process is then displayed.

 # mydup&
 addr = 0xff390000
 # crash
 > p ! grep mydup
 38 s 4836 4800 4836 4800 0 46 map load
 > p -f 38
 PROC TABLE SIZE = 1882
 SLOT ST PID PPID PGID SID UID PRI NAME FLAGS
 38 s 4836 4800 4836 4800 0 46 map load
 Session: sid: 4800, ctty: vnode(30001031448) maj(24) min(1)
 Process Credentials: uid: 0, gid: 1, real uid: 0, real gid: 1
 as: 300005d8ff8
 ...

From within adb the address space can be displayed by invoking the as macro. This shows a pointer to the list of segments corresponding to this process. In this case there are 12 segments. The seglist macro then displays each segment in the list. In this case, only the segment corresponding to the mapped file is displayed. This is located by looking at the base address of the segment that corresponds to the address returned from mmap(), which is displayed above.

300005d8ff8$ < as
 ...
 300005d9040: segs size tail
 30000b5a2a8 e0000 30000b5a190
 300005d9058: nsegs lrep hilevel
 12 0 0
 ...
 30000b5a2a8$ < seglist
 ...
 30000b11f80: base size as
 ff390000 2000 300005d8ff8
 152 UNIX Filesystems.Evolution, Design, and Implementation
 30000b11f98: next prev ops
 30000b5a4a0 30000b5b8c0 segvn_ops
 30000b11fb0: data
 30000b4d138
 ...

Note that in addition to the base address, the size of the segment corresponds to the size of the mapping requested, in this case 8KB. The data field points to private segment-specific data. This can be displayed using the segvn macro as follows:

30000b4d138$ < segvn
 ...
 30000b4d143: type offset vp
 01 0 30000aafe30
 ...

Of most interest here, the vp field points to the vnode from which this segment is backed. The offset field gives the offset within the file which, as specified to mmap(), is 0. The remaining two macro calls display the vnode referenced previously and the UFS inode corresponding to the vnode.

30000aafe30$ < vnode
 30000aafe38: flag refcnt vfsmnt
 0 3 0
 30000aafe48: op vfsp stream
 ufs_vnodeops 104440b0 0
 30000aafe60: pages type rdev
 106fbe80 1 0
 30000aafe78: data filocks shrlocks
 30000aafda0 0 0
 30000aafda0$ < inode
 ...
 30000aafeb8: number diroff ufsvfs
 129222 0 3000016ee18
 ...

As a check, the inode number is displayed and also displayed below:

# ls -i /etc/passwd
 129222 /etc/passwd

Flushing Dirty Pages to Disk

There are a number of cases where modified pages need to be written to disk. This may result from the pager finding pages to steal, an explicit call to msync(), or when a process exits and modified pages within a mapping need to be written back to disk. The VOP_PUTPAGE() vnode operation is called to write a single page back to disk. Development of the SVR4 VFS/Vnode Architecture 153

The single page approach may not be ideal for filesystems such as VxFS that can have multipage extents. The same also holds true for any filesystem where the block size is greater than the page size. Rather than flush a single dirty page to disk, it is preferable to flush a range of pages. For VxFS this may cover all dirty pages within the extent that may be in memory. The VM subsystem provides a number of routines for manipulating lists of pages. For example, the function pvn_getdirty_range() can be called to gather all dirty pages in the specified range. All pages within this range are gathered together in a linked list and passed to a filesystem-specified routine, that can then proceed to write the page list to disk.

Page-Based I/O

Prior to SVR4, all I/O went through the buffer cache. Each buffer pointed to a kernel virtual address where the data could be transferred to and from. With the change to a page-based model for file I/O in SVR4, the filesystem deals with pages for file data I/O and may wish to perform I/O to more than one page at a time. For example, as described in the previous section, a call back into the filesystem from pvn_getdirty_range() passes a linked list of page structures. However, these pages do not typically have associated kernel virtual addresses. To avoid an unnecessary use of kernel virtual address space and an increased cost in time to map these pages, the buffer cache subsystem as well as the underlying device drivers were modified to accept a list of pages. In this case, the b_pages field is set to point to the linked list of pages and the B_PAGES field must be set.

At the stage that the filesystem wishes to perform I/O, it will typically have a linked list of pages into which data needs to be read or from which data needs to be written. To prevent duplication across filesystems, the kernel provides a function, pageio_setup(), which allocates a buf structure, attaches the list of pages to b_pages, and initializes the b_flags to include B_PAGES. This is used by the driver the indicate that page I/O is being performed and that b_pages should be used and not b_addr. Note that this buffer is not part of the buffer cache.

The I/O is actually performed by calling the driver strategy function. If the filesystem needs to wait for the I/O completion, it must call biowait(), passing the buf structure as an argument. After the I/O is complete, a call to pageio_done() will free the buffer, leaving the page list intact.

Adoption of the SVR4 Vnode Interface

Although many OS vendors implemented the VFS/vnode architecture within the framework of their UNIX implementations, the SVR4 style of page I/O, while elegant and efficient in usage of the underlying memory, failed to gain widespread adoption. In part this was due to the closed nature in which SVR4 was developed because the implementation was not initially documented. An additional reason was due to the amount of change that was needed both to the VM subsystem as well as every filesystem supported. 154 UNIX Filesystems.Evolution, Design, and Implementation

Summary

The period between development of both SVR3 and SunOS and the transition to SVR4 saw a substantial investment in both the filesystem framework within the kernel and the development of individual filesystems. The VFS/vnode architecture has proved to be immensely popular and has been ported in one way or another to most versions of UNIX. For further details of SVR4.0, Goodheart and Cox_fs book The Magic Garden Explained: The Internals of System V Release 4, An Open Systems Design [GOOD94] provides a detailed account of SVR4 kernel internals. For details on the File System Switch (FSS) architecture, Pate_fs book UNIX Internals.A Practical Approach [PATE96] is one of the few references.
Оставьте свой комментарий !

Ваше имя:
Комментарий:
Оба поля являются обязательными

Автор Комментарий к данной статье