Serg Iakovlev

Мы начнем с описания функций доступа к файлам. Почти все операции с файлами в юниксе можно свести к 5 функциям: open, read, write, lseek, close. Мы рассмотрим влияние размера буффера в функциях read и write

Функции,описанные в этой главе,небуферизованы-они делают системный вызов. Они относятся не к ISO C, а к POSIX.1 и Single UNIX Specification.

Когда мы говорим о расшаренных ресурсах, важное значение придается атомарной операции. Мы рассмотрим это на примере параметров функции open. Также будут рассмотрены функции dup, fcntl, sync, fsync, и ioctl

Файловые дескрипторы

С точки зрения ядра,все открытые файлы ссылаются на файловые дескрипторы. Файловый дескриптор-это целое положительное число. При создании или открытии файла ядро возвращает процессу дескриптор файла. По умолчанию,нулевой файловый дескриптор-это стандартный ввод процесса, дескриптор 1 - стандартный вывод, дескриптор 2 - error. Это не фича ядра,это просто соглашение.

0, 1, 2 соответствуют константам STDIN_FILENO, STDOUT_FILENO, и STDERR_FILENO в хидере <unistd.h>.

Диапазон дескрипторов - от 0 до OPEN_MAX. В ранних версиях он был ограничен 19, сейчас он вырос до 63.

Функция `open`

A file is opened or created by calling the open function.

[View full width]
#include <fcntl.h> int open(const char *pathname, int oflag, ... /* mode_t mode */ );

Returns: file descriptor if OK, 1 on error

We show the third argument as ..., which is the ISO C way to specify that the number and types of the remaining arguments may vary. For this function, the third argument is used only when a new file is being created, as we describe later. We show this argument as a comment in the prototype.

The pathname is the name of the file to open or create. This function has a multitude of options, which are specified by the oflag argument. This argument is formed by ORing together one or more of the following constants from the <fcntl.h> header:

O_RDONLY
Open for reading only.
O_WRONLY
Open for writing only.
O_RDWR
Open for reading and writing.

Most implementations define O_RDONLY as 0, O_WRONLY as 1, and O_RDWR as 2, for compatibility with older programs.

One and only one of these three constants must be specified. The following constants are optional:

O_APPEND
Append to the end of file on each write. We describe this option in detail in Section 3.11.
O_CREAT
Create the file if it doesn't exist. This option requires a third argument to the open function, the mode, which specifies the access permission bits of the new file. (When we describe a file's access permission bits in Section 4.5, we'll see how to specify the mode and how it can be modified by the umask value of a process.)
O_EXCL
Generate an error if O_CREAT is also specified and the file already exists. This test for whether the file already exists and the creation of the file if it doesn't exist is an atomic operation. We describe atomic operations in more detail in Section 3.11.
O_TRUNC
If the file exists and if it is successfully opened for either write-only or readwrite, truncate its length to 0.
O_NOCTTY
If the pathname refers to a terminal device, do not allocate the device as the controlling terminal for this process. We talk about controlling terminals in Section 9.6.
O_NONBLOCK
If the pathname refers to a FIFO, a block special file, or a character special file, this option sets the nonblocking mode for both the opening of the file and subsequent I/O. We describe this mode in Section 14.2.

In earlier releases of System V, the O_NDELAY (no delay) flag was introduced. This option is similar to the O_NONBLOCK (nonblocking) option, but an ambiguity was introduced in the return value from a read operation. The no-delay option causes a read to return 0 if there is no data to be read from a pipe, FIFO, or device, but this conflicts with a return value of 0, indicating an end of file. SVR4-based systems still support the no-delay option, with the old semantics, but new applications should use the nonblocking option instead.

The following three flags are also optional. They are part of the synchronized input and output option of the Single UNIX Specification (and thus POSIX.1):

O_DSYNC
Have each write wait for physical I/O to complete, but don't wait for file attributes to be updated if they don't affect the ability to read the data just written.
O_RSYNC
Have each read operation on the file descriptor wait until any pending writes for the same portion of the file are complete.
O_SYNC
Have each write wait for physical I/O to complete, including I/O necessary to update file attributes modified as a result of the write. We use this option in Section 3.14.

The O_DSYNC and O_SYNC flags are similar, but subtly different. The O_DSYNC flag affects a file's attributes only when they need to be updated to reflect a change in the file's data (for example, update the file's size to reflect more data). With the O_SYNC flag, data and attributes are always updated synchronously. When overwriting an existing part of a file opened with the O_DSYNC flag, the file times wouldn't be updated synchronously. In contrast, if we had opened the file with the O_SYNC flag, every write to the file would update the file's times before the write returns, regardless of whether we were writing over existing bytes or appending to the file.

Solaris 9 supports all three flags. FreeBSD 5.2.1 and Mac OS X 10.3 have a separate flag (O_FSYNC) that does the same thing as O_SYNC. Because the two flags are equivalent, FreeBSD 5.2.1 defines them to have the same value (but curiously, Mac OS X 10.3 doesn't define O_SYNC). FreeBSD 5.2.1 and Mac OS X 10.3 don't support the O_DSYNC or O_RSYNC flags. Linux 2.4.22 treats both flags the same as O_SYNC.

The file descriptor returned by open is guaranteed to be the lowest-numbered unused descriptor. This fact is used by some applications to open a new file on standard input, standard output, or standard error. For example, an application might close standard outputnormally, file descriptor 1and then open another file, knowing that it will be opened on file descriptor 1. We'll see a better way to guarantee that a file is open on a given descriptor in Section 3.12 with the dup2 function.

Filename and Pathname Truncation

What happens if NAME_MAX is 14 and we try to create a new file in the current directory with a filename containing 15 characters? Traditionally, early releases of System V, such as SVR2, allowed this to happen, silently truncating the filename beyond the 14th character. BSD-derived systems returned an error status, with errno set to ENAMETOOLONG. Silently truncating the filename presents a problem that affects more than simply the creation of new files. If NAME_MAX is 14 and a file exists whose name is exactly 14 characters, any function that accepts a pathname argument, such as open or stat, has no way to determine what the original name of the file was, as the original name might have been truncated.

With POSIX.1, the constant _POSIX_NO_TRUNC determines whether long filenames and long pathnames are truncated or whether an error is returned. As we saw in Chapter 2, this value can vary based on the type of the file system.

Whether or not an error is returned is largely historical. For example, SVR4-based systems do not generate an error for the traditional System V file system, S5. For the BSD-style file system (known as UFS), however, SVR4-based systems do generate an error.

As another example, see Figure 2.19. Solaris will return an error for UFS, but not for PCFS, the DOS-compatible file system, as DOS silently truncates filenames that don't fit in an 8.3 format.

BSD-derived systems and Linux always return an error.

If _POSIX_NO_TRUNC is in effect, errno is set to ENAMETOOLONG, and an error status is returned if the entire pathname exceeds PATH_MAX or any filename component of the pathname exceeds NAME_MAX.

3.4. `creat` Function

A new file can also be created by calling the creat function.

#include <fcntl.h> int creat(const char *pathname, mode_t mode);

Returns: file descriptor opened for write-only if OK, 1 on error

Note that this function is equivalent to

     open (pathname, O_WRONLY | O_CREAT | O_TRUNC, mode);

Historically, in early versions of the UNIX System, the second argument to open could be only 0, 1, or 2. There was no way to open a file that didn't already exist. Therefore, a separate system call, creat, was needed to create new files. With the O_CREAT and O_TRUNC options now provided by open, a separate creat function is no longer needed.

We'll show how to specify mode in Section 4.5 when we describe a file's access permissions in detail.

One deficiency with creat is that the file is opened only for writing. Before the new version of open was provided, if we were creating a temporary file that we wanted to write and then read back, we had to call creat, close, and then open. A better way is to use the open function, as in

     open (pathname, O_RDWR | O_CREAT | O_TRUNC, mode);

3.5. `close` Function

An open file is closed by calling the close function.

#include <unistd.h> int close(int filedes);

Returns: 0 if OK, 1 on error

Closing a file also releases any record locks that the process may have on the file. We'll discuss this in Section 14.3.

When a process terminates, all of its open files are closed automatically by the kernel. Many programs take advantage of this fact and don't explicitly close open files. See the program in Figure 1.4, for example.

3.6. `lseek` Function

Every open file has an associated "current file offset," normally a non-negative integer that measures the number of bytes from the beginning of the file. (We describe some exceptions to the "non-negative" qualifier later in this section.) Read and write operations normally start at the current file offset and cause the offset to be incremented by the number of bytes read or written. By default, this offset is initialized to 0 when a file is opened, unless the O_APPEND option is specified.

An open file's offset can be set explicitly by calling lseek.

#include <unistd.h> off_t lseek(int filedes, off_t offset, int whence);

Returns: new file offset if OK, 1 on error

The interpretation of the offset depends on the value of the whence argument.

If whence is SEEK_SET, the file's offset is set to offset bytes from the beginning of the file.
If whence is SEEK_CUR, the file's offset is set to its current value plus the offset. The offset can be positive or negative.
If whence is SEEK_END, the file's offset is set to the size of the file plus the offset. The offset can be positive or negative.

Because a successful call to lseek returns the new file offset, we can seek zero bytes from the current position to determine the current offset:

     off_t    currpos;
 
     currpos = lseek(fd, 0, SEEK_CUR);

This technique can also be used to determine if a file is capable of seeking. If the file descriptor refers to a pipe, FIFO, or socket, lseek sets errno to ESPIPE and returns 1.

The three symbolic constantsSEEK_SET, SEEK_CUR, and SEEK_ENDwere introduced with System V. Prior to this, whence was specified as 0 (absolute), 1 (relative to current offset), or 2 (relative to end of file). Much software still exists with these numbers hard coded.

The character l in the name lseek means "long integer." Before the introduction of the off_t data type, the offset argument and the return value were long integers. lseek was introduced with Version 7 when long integers were added to C. (Similar functionality was provided in Version 6 by the functions seek and tell.)

Example

The program in Figure 3.1 tests its standard input to see whether it is capable of seeking.

If we invoke this program interactively, we get

    $ ./a.out < /etc/motd
    seek OK
    $ cat < /etc/motd | ./a.out
    cannot seek
    $ ./a.out < /var/spool/cron/FIFO
    cannot seek

Figure 3.1. Test whether standard input is capable of seeking

 #include "apue.h"
 
 int
 main(void)
 {
     if (lseek(STDIN_FILENO, 0, SEEK_CUR) == -1)
        printf("cannot seek\n");
     else
        printf("seek OK\n");
     exit(0);
 }

Normally, a file's current offset must be a non-negative integer. It is possible, however, that certain devices could allow negative offsets. But for regular files, the offset must be non-negative. Because negative offsets are possible, we should be careful to compare the return value from lseek as being equal to or not equal to 1 and not test if it's less than 0.

The /dev/kmem device on FreeBSD for the Intel x86 processor supports negative offsets.

Because the offset (off_t) is a signed data type (Figure 2.20), we lose a factor of 2 in the maximum file size. If off_t is a 32-bit integer, the maximum file size is 2³¹-1 bytes.

lseek only records the current file offset within the kernelit does not cause any I/O to take place. This offset is then used by the next read or write operation.

The file's offset can be greater than the file's current size, in which case the next write to the file will extend the file. This is referred to as creating a hole in a file and is allowed. Any bytes in a file that have not been written are read back as 0.

A hole in a file isn't required to have storage backing it on disk. Depending on the file system implementation, when you write after seeking past the end of the file, new disk blocks might be allocated to store the data, but there is no need to allocate disk blocks for the data between the old end of file and the location where you start writing.

Example

The program shown in Figure 3.2 creates a file with a hole in it.

Running this program gives us

     $ ./a.out
     $ ls -l file.hole                  check its size
     -rw-r--r-- 1 sar          16394 Nov 25 01:01 file.hole
     $ od -c file.hole                  let's look at the actual contents
     0000000   a  b  c  d  e  f  g  h  i  j \0 \0 \0 \0 \0 \0
     0000020  \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
     *
     0040000   A  B  C  D  E  F  G  H  I  J
     0040012

We use the od(1) command to look at the contents of the file. The -c flag tells it to print the contents as characters. We can see that the unwritten bytes in the middle are read back as zero. The seven-digit number at the beginning of each line is the byte offset in octal.

To prove that there is really a hole in the file, let's compare the file we've just created with a file of the same size, but without holes:

     $ ls -ls file.hole file.nohole    compare sizes
       8 -rw-r--r-- 1 sar        16394 Nov 25 01:01 file.hole
      20 -rw-r--r-- 1 sar        16394 Nov 25 01:03 file.nohole

Although both files are the same size, the file without holes consumes 20 disk blocks, whereas the file with holes consumes only 8 blocks.

In this example, we call the write function (Section 3.8). We'll have more to say about files with holes in Section 4.12.

Figure 3.2. Create a file with a hole in it

 #include "apue.h"
 #include <fcntl.h>
 
 char    buf1[] = "abcdefghij";
 char    buf2[] = "ABCDEFGHIJ";
 
 int
 main(void)
 {
     int     fd;
 
     if ((fd = creat("file.hole", FILE_MODE)) < 0)
         err_sys("creat error");
 
     if (write(fd, buf1, 10) != 10)
         err_sys("buf1 write error");
     /* offset now = 10 */
 
     if (lseek(fd, 16384, SEEK_SET) == -1)
         err_sys("lseek error");
     /* offset now = 16384 */
 
     if (write(fd, buf2, 10) != 10)
         err_sys("buf2 write error");
     /* offset now = 16394 */
 
     exit(0);
 }

Because the offset address that lseek uses is represented by an off_t, implementations are allowed to support whatever size is appropriate on their particular platform. Most platforms today provide two sets of interfaces to manipulate file offsets: one set that uses 32-bit file offsets and another set that uses 64-bit file offsets.

The Single UNIX Specification provides a way for applications to determine which environments are supported through the sysconf function (Section 2.5.4.). Figure 3.3 summarizes the sysconf constants that are defined.

Figure 3.3. Data size options and name arguments to sysconf
Name of option
Description
name argument
_POSIX_V6_ILP32_OFF32
int, long, pointer, and off_t types are 32 bits.
_SC_V6_ILP32_OFF32
_POSIX_V6_ILP32_OFFBIG
int, long, and pointer types are 32 bits; off_t types are at least 64 bits.
_SC_V6_ILP32_OFFBIG
_POSIX_V6_LP64_OFF64
int types are 32 bits; long, pointer, and off_t types are 64 bits.
_SC_V6_LP64_OFF64
_POSIX_V6_LP64_OFFBIG
int types are 32 bits; long, pointer, and off_t types are at least 64 bits.
_SC_V6_LP64_OFFBIG

The c99 compiler requires that we use the getconf(1) command to map the desired data size model to the flags necessary to compile and link our programs. Different flags and libraries might be needed, depending on the environments supported by each platform.

Unfortunately, this is one area in which implementations haven't caught up to the standards. Confusing things further is the name changes that were made between Version 2 and Version 3 of the Single UNIX Specification.

To get around this, applications can set the _FILE_OFFSET_BITS constant to 64 to enable 64-bit offsets. Doing so changes the definition of off_t to be a 64-bit signed integer. Setting _FILE_OFFSET_BITS to 32 enables 32-bit file offsets. Be aware, however, that although all four platforms discussed in this text support both 32-bit and 64-bit file offsets by setting the _FILE_OFFSET_BITS constant to the desired value, this is not guaranteed to be portable.

Note that even though you might enable 64-bit file offsets, your ability to create a file larger than 2 TB (2³¹-1 bytes) depends on the underlying file system type.

3.7. `read` Function

Data is read from an open file with the read function.

#include <unistd.h> ssize_t read(int filedes, void *buf, size_t nbytes);

Returns: number of bytes read, 0 if end of file, 1 on error

If the read is successful, the number of bytes read is returned. If the end of file is encountered, 0 is returned.

There are several cases in which the number of bytes actually read is less than the amount requested:

When reading from a regular file, if the end of file is reached before the requested number of bytes has been read. For example, if 30 bytes remain until the end of file and we try to read 100 bytes, read returns 30. The next time we call read, it will return 0 (end of file).
When reading from a terminal device. Normally, up to one line is read at a time. (We'll see how to change this in Chapter 18.)
When reading from a network. Buffering within the network may cause less than the requested amount to be returned.
When reading from a pipe or FIFO. If the pipe contains fewer bytes than requested, read will return only what is available.
When reading from a record-oriented device. Some record-oriented devices, such as magnetic tape, can return up to a single record at a time.
When interrupted by a signal and a partial amount of data has already been read. We discuss this further in Section 10.5.

The read operation starts at the file's current offset. Before a successful return, the offset is incremented by the number of bytes actually read.

POSIX.1 changed the prototype for this function in several ways. The classic definition is

     int read(int filedes, char *buf, unsigned nbytes);

First, the second argument was changed from a char * to a void * to be consistent with ISO C: the type void * is used for generic pointers.
Next, the return value must be a signed integer (ssize_t) to return a positive byte count, 0 (for end of file), or 1 (for an error).
Finally, the third argument historically has been an unsigned integer, to allow a 16-bit implementation to read or write up to 65,534 bytes at a time. With the 1990 POSIX.1 standard, the primitive system data type ssize_t was introduced to provide the signed return value, and the unsigned size_t was used for the third argument. (Recall the SSIZE_MAX constant from Section 2.5.2.)

3.8. `write` Function

Data is written to an open file with the write function.

[View full width]
#include <unistd.h> ssize_t write(int filedes, const void *buf, size_t nbytes);

Returns: number of bytes written if OK, 1 on error

The return value is usually equal to the nbytes argument; otherwise, an error has occurred. A common cause for a write error is either filling up a disk or exceeding the file size limit for a given process (Section 7.11 and Exercise 10.11).

For a regular file, the write starts at the file's current offset. If the O_APPEND option was specified when the file was opened, the file's offset is set to the current end of file before each write operation. After a successful write, the file's offset is incremented by the number of bytes actually written.

3.9. I/O Efficiency

The program in Figure 3.4 copies a file, using only the read and write functions. The following caveats apply to this program.

Figure 3.4. Copy standard input to standard output

 #include "apue.h"
 
 #define BUFFSIZE 4096
 
 int
 main(void)
 {
     int    n;
     char   buf[BUFFSIZE];
 
     while ((n = read(STDIN_FILENO, buf, BUFFSIZE)) > 0)
         if (write(STDOUT_FILENO, buf, n) != n)
             err_sys("write error");
 
     if (n < 0)
         err_sys("read error");
 
     exit(0);
 }

It reads from standard input and writes to standard output, assuming that these have been set up by the shell before this program is executed. Indeed, all normal UNIX system shells provide a way to open a file for reading on standard input and to create (or rewrite) a file on standard output. This prevents the program from having to open the input and output files.

Many applications assume that standard input is file descriptor 0 and that standard output is file descriptor 1. In this example, we use the two defined names, STDIN_FILENO and STDOUT_FILENO, from <unistd.h>.
The program doesn't close the input file or output file. Instead, the program uses the feature of the UNIX kernel that closes all open file descriptors in a process when that process terminates.
This example works for both text files and binary files, since there is no difference between the two to the UNIX kernel.

One question we haven't answered, however, is how we chose the BUFFSIZE value. Before answering that, let's run the program using different values for BUFFSIZE. Figure 3.5 shows the results for reading a 103,316,352-byte file, using 20 different buffer sizes.

The file was read using the program shown in Figure 3.4, with standard output redirected to /dev/null. The file system used for this test was the Linux ext2 file system with 4,096-byte blocks. (The st_blksize value, which we describe in Section 4.12, is 4,096.) This accounts for the minimum in the system time occurring at a BUFFSIZE of 4,096. Increasing the buffer size beyond this has little positive effect.

Most file systems support some kind of read-ahead to improve performance. When sequential reads are detected, the system tries to read in more data than an application requests, assuming that the application will read it shortly. From the last few entries in Figure 3.5, it appears that read-ahead in ext2 stops having an effect after 128 KB.

Figure 3.5. Timing results for reading with different buffer sizes on Linux
BUFFSIZE
User CPU (seconds)
System CPU (seconds)
Clock time (seconds)
#loops
1
124.89
161.65
288.64
103,316,352
2
63.10
80.96
145.81
51,658,#176
4
31.84
40.00
72.75
25,829,088
8
15.17
21.01
36.85
12,914,544
16
7.86
10.27
18.76
6,457,272
32
4.13
5.01
9.76
3,228,636
64
2.11
2.48
6.76
1,614,318
128
1.01
1.27
6.82
807,159
256
0.56
0.62
6.80
403,579
512
0.27
0.41
7.03
201,789
1,024
0.17
0.23
7.84
100,894
2,048
0.05
0.19
6.82
50,447
4,096
0.03
0.16
6.86
25,223
8,192
0.01
0.18
6.67
12,611
16,384
0.02
0.18
6.87
6,305
32,768
0.00
0.16
6.70
3,152
65,536
0.02
0.19
6.92
1,576
131,072
0.00
0.16
6.84
788
262,144
0.01
0.25
7.30
394
524,288
0.00
0.22
7.35
198

We'll return to this timing example later in the text. In Section 3.14, we show the effect of synchronous writes; in Section 5.8, we compare these unbuffered I/O times with the standard I/O library.

Beware when trying to measure the performance of programs that read and write files. The operating system will try to cache the file incore, so if you measure the performance of the program repeatedly, the successive timings will likely be better than the first. This is because the first run will cause the file to be entered into the system's cache, and successive runs will access the file from the system's cache instead of from the disk. (The term incore means in main memory. Back in the day, a computer's main memory was built out of ferrite core. This is where the phrase "core dump" comes from: the main memory image of a program stored in a file on disk for diagnosis.)

In the tests reported in Figure 3.5, each run with a different buffer size was made using a different copy of the file so that the current run didn't find the data in the cache from the previous run. The files are large enough that they all don't remain in the cache (the test system was configured with 512 MB of RAM).

3.10. File Sharing

The UNIX System supports the sharing of open files among different processes. Before describing the dup function, we need to describe this sharing. To do this, we'll examine the data structures used by the kernel for all I/O.

The following description is conceptual. It may or may not match a particular implementation. Refer to Bach [1986] for a discussion of these structures in System V. McKusick et al. [1996] describes these structures in 4.4BSD. McKusick and Neville-Neil [2005] cover FreeBSD 5.2. For a similar discussion of Solaris, see Mauro and McDougall [2001].

The kernel uses three data structures to represent an open file, and the relationships among them determine the effect one process has on another with regard to file sharing.

Every process has an entry in the process table. Within each process table entry is a table of open file descriptors, which we can think of as a vector, with one entry per descriptor. Associated with each file descriptor are
1. The file descriptor flags (close-on-exec; refer to Figure 3.6 and Section 3.14)
2. A pointer to a file table entry
The kernel maintains a file table for all open files. Each file table entry contains
1. The file status flags for the file, such as read, write, append, sync, and nonblocking; more on these in Section 3.14
2. The current file offset
3. A pointer to the v-node table entry for the file
Each open file (or device) has a v-node structure that contains information about the type of file and pointers to functions that operate on the file. For most files, the v-node also contains the i-node for the file. This information is read from disk when the file is opened, so that all the pertinent information about the file is readily available. For example, the i-node contains the owner of the file, the size of the file, pointers to where the actual data blocks for the file are located on disk, and so on. (We talk more about i-nodes in Section 4.14 when we describe the typical UNIX file system in more detail.)

Linux has no v-node. Instead, a generic i-node structure is used. Although the implementations differ, the v-node is conceptually the same as a generic i-node. Both point to an i-node structure specific to the file system.

We're ignoring some implementation details that don't affect our discussion. For example, the table of open file descriptors can be stored in the user area instead of the process table. These tables can be implemented in numerous waysthey need not be arrays; they could be implemented as linked lists of structures, for example. These implementation details don't affect our discussion of file sharing.

Figure 3.6 shows a pictorial arrangement of these three tables for a single process that has two different files open: one file is open on standard input (file descriptor 0), and the other is open on standard output (file descriptor 1). The arrangement of these three tables has existed since the early versions of the UNIX System [Thompson 1978], and this arrangement is critical to the way files are shared among processes. We'll return to this figure in later chapters, when we describe additional ways that files are shared.

Figure 3.6. Kernel data structures for open files

[View full size image]

The v-node was invented to provide support for multiple file system types on a single computer system. This work was done independently by Peter Weinberger (Bell Laboratories) and Bill Joy (Sun Microsystems). Sun called this the Virtual File System and called the file systemindependent portion of the i-node the v-node [Kleiman 1986]. The v-node propagated through various vendor implementations as support for Sun's Network File System (NFS) was added. The first release from Berkeley to provide v-nodes was the 4.3BSD Reno release, when NFS was added.

In SVR4, the v-node replaced the file systemindependent i-node of SVR3. Solaris is derived from SVR4 and thus uses v-nodes.

Instead of splitting the data structures into a v-node and an i-node, Linux uses a file systemindependent i-node and a file systemdependent i-node.

If two independent processes have the same file open, we could have the arrangement shown in Figure 3.7. We assume here that the first process has the file open on descriptor 3 and that the second process has that same file open on descriptor 4. Each process that opens the file gets its own file table entry, but only a single v-node table entry is required for a given file. One reason each process gets its own file table entry is so that each process has its own current offset for the file.

Figure 3.7. Two independent processes with the same file open

[View full size image]

Given these data structures, we now need to be more specific about what happens with certain operations that we've already described.

After each write is complete, the current file offset in the file table entry is incremented by the number of bytes written. If this causes the current file offset to exceed the current file size, the current file size in the i-node table entry is set to the current file offset (for example, the file is extended).
If a file is opened with the O_APPEND flag, a corresponding flag is set in the file status flags of the file table entry. Each time a write is performed for a file with this append flag set, the current file offset in the file table entry is first set to the current file size from the i-node table entry. This forces every write to be appended to the current end of file.
If a file is positioned to its current end of file using lseek, all that happens is the current file offset in the file table entry is set to the current file size from the i-node table entry. (Note that this is not the same as if the file was opened with the O_APPEND flag, as we will see in Section 3.11.)
The lseek function modifies only the current file offset in the file table entry. No I/O takes place.

It is possible for more than one file descriptor entry to point to the same file table entry, as we'll see when we discuss the dup function in Section 3.12. This also happens after a fork when the parent and the child share the same file table entry for each open descriptor (Section 8.3).

Note the difference in scope between the file descriptor flags and the file status flags. The former apply only to a single descriptor in a single process, whereas the latter apply to all descriptors in any process that point to the given file table entry. When we describe the fcntl function in Section 3.14, we'll see how to fetch and modify both the file descriptor flags and the file status flags.

Everything that we've described so far in this section works fine for multiple processes that are reading the same file. Each process has its own file table entry with its own current file offset. Unexpected results can arise, however, when multiple processes write to the same file. To see how to avoid some surprises, we need to understand the concept of atomic operations.

3.11. Atomic Operations

Appending to a File

Consider a single process that wants to append to the end of a file. Older versions of the UNIX System didn't support the O_APPEND option to open, so the program was coded as follows:

      if (lseek(fd, 0L, 2) < 0)                /* position to EOF */
         err_sys("lseek error");
      if (write(fd, buf, 100) != 100)          /* and write */
         err_sys("write error");

This works fine for a single process, but problems arise if multiple processes use this technique to append to the same file. (This scenario can arise if multiple instances of the same program are appending messages to a log file, for example.)

Assume that two independent processes, A and B, are appending to the same file. Each has opened the file but without the O_APPEND flag. This gives us the same picture as Figure 3.7. Each process has its own file table entry, but they share a single v-node table entry. Assume that process A does the lseek and that this sets the current offset for the file for process A to byte offset 1,500 (the current end of file). Then the kernel switches processes, and B continues running. Process B then does the lseek, which sets the current offset for the file for process B to byte offset 1,500 also (the current end of file). Then B calls write, which increments B's current file offset for the file to 1,600. Because the file's size has been extended, the kernel also updates the current file size in the v-node to 1,600. Then the kernel switches processes and A resumes. When A calls write, the data is written starting at the current file offset for A, which is byte offset 1,500. This overwrites the data that B wrote to the file.

The problem here is that our logical operation of "position to the end of file and write" requires two separate function calls (as we've shown it). The solution is to have the positioning to the current end of file and the write be an atomic operation with regard to other processes. Any operation that requires more than one function call cannot be atomic, as there is always the possibility that the kernel can temporarily suspend the process between the two function calls (as we assumed previously).

The UNIX System provides an atomic way to do this operation if we set the O_APPEND flag when a file is opened. As we described in the previous section, this causes the kernel to position the file to its current end of file before each write. We no longer have to call lseek before each write.

`pread` and `pwrite` Functions

The Single UNIX Specification includes XSI extensions that allow applications to seek and perform I/O atomically. These extensions are pread and pwrite.

[View full width]
#include <unistd.h> ssize_t pread(int filedes, void *buf, size_t nbytes, off_t offset);

Returns: number of bytes read, 0 if end of file, 1 on error

[View full width]
ssize_t pwrite(int filedes, const void *buf, size_t nbytes, off_t offset);

Returns: number of bytes written if OK, 1 on error

Calling pread is equivalent to calling lseek followed by a call to read, with the following exceptions.

There is no way to interrupt the two operations using pread.
The file pointer is not updated.

Calling pwrite is equivalent to calling lseek followed by a call to write, with similar exceptions.

Creating a File

We saw another example of an atomic operation when we described the O_CREAT and O_EXCL options for the open function. When both of these options are specified, the open will fail if the file already exists. We also said that the check for the existence of the file and the creation of the file was performed as an atomic operation. If we didn't have this atomic operation, we might try

     if ((fd = open(pathname, O_WRONLY)) < 0) {
         if (errno == ENOENT) {
             if ((fd = creat(pathname, mode)) < 0)
                  err_sys("creat error");
         } else {
             err_sys("open error");
         }
     }

The problem occurs if the file is created by another process between the open and the creat. If the file is created by another process between these two function calls, and if that other process writes something to the file, that data is erased when this creat is executed. Combining the test for existence and the creation into a single atomic operation avoids this problem.

In general, the term atomic operation refers to an operation that might be composed of multiple steps. If the operation is performed atomically, either all the steps are performed, or none are performed. It must not be possible for a subset of the steps to be performed. We'll return to the topic of atomic operations when we describe the link function (Section 4.15) and record locking (Section 14.3).

3.12. `dup` and `dup2` Functions

An existing file descriptor is duplicated by either of the following functions.

#include <unistd.h> int dup(int filedes); int dup2(int filedes, int filedes2);

Both return: new file descriptor if OK, 1 on error

The new file descriptor returned by dup is guaranteed to be the lowest-numbered available file descriptor. With dup2, we specify the value of the new descriptor with the filedes2 argument. If filedes2 is already open, it is first closed. If filedes equals filedes2, then dup2 returns filedes2 without closing it.

The new file descriptor that is returned as the value of the functions shares the same file table entry as the filedes argument. We show this in Figure 3.8.

Figure 3.8. Kernel data structures after `dup`(1)

[View full size image]

In this figure, we're assuming that when it's started, the process executes

     newfd = dup(1);

We assume that the next available descriptor is 3 (which it probably is, since 0, 1, and 2 are opened by the shell). Because both descriptors point to the same file table entry, they share the same file status flagsread, write, append, and so onand the same current file offset.

Each descriptor has its own set of file descriptor flags. As we describe in the next section, the close-on-exec file descriptor flag for the new descriptor is always cleared by the dup functions.

Another way to duplicate a descriptor is with the fcntl function, which we describe in Section 3.14. Indeed, the call

     dup(filedes);

is equivalent to

     fcntl(filedes, F_DUPFD, 0);

Similarly, the call

     dup2(filedes, filedes2);

is equivalent to

     close(filedes2);
     fcntl(filedes, F_DUPFD, filedes2);

In this last case, the dup2 is not exactly the same as a close followed by an fcntl. The differences are as follows.

dup2 is an atomic operation, whereas the alternate form involves two function calls. It is possible in the latter case to have a signal catcher called between the close and the fcntl that could modify the file descriptors. (We describe signals in Chapter 10.)
There are some errno differences between dup2 and fcntl.

The dup2 system call originated with Version 7 and propagated through the BSD releases. The fcntl method for duplicating file descriptors appeared with System III and continued with System V. SVR3.2 picked up the dup2 function, and 4.2BSD picked up the fcntl function and the F_DUPFD functionality. POSIX.1 requires both dup2 and the F_DUPFD feature of fcntl.

3.13. `sync`, `fsync`, and `fdatasync` Functions

Traditional implementations of the UNIX System have a buffer cache or page cache in the kernel through which most disk I/O passes. When we write data to a file, the data is normally copied by the kernel into one of its buffers and queued for writing to disk at some later time. This is called delayed write. (Chapter 3 of Bach [1986] discusses this buffer cache in detail.)

The kernel eventually writes all the delayed-write blocks to disk, normally when it needs to reuse the buffer for some other disk block. To ensure consistency of the file system on disk with the contents of the buffer cache, the sync, fsync, and fdatasync functions are provided.

#include <unistd.h> int fsync(int filedes); int fdatasync(int filedes);

Returns: 0 if OK, 1 on error

void sync(void);

The sync function simply queues all the modified block buffers for writing and returns; it does not wait for the disk writes to take place.

The function sync is normally called periodically (usually every 30 seconds) from a system daemon, often called update. This guarantees regular flushing of the kernel's block buffers. The command sync(1) also calls the sync function.

The function fsync refers only to a single file, specified by the file descriptor filedes, and waits for the disk writes to complete before returning. The intended use of fsync is for an application, such as a database, that needs to be sure that the modified blocks have been written to the disk.

The fdatasync function is similar to fsync, but it affects only the data portions of a file. With fsync, the file's attributes are also updated synchronously.

All four of the platforms described in this book support sync and fsync. However, FreeBSD 5.2.1 and Mac OS X 10.3 do not support fdatasync.

3.14. `fcntl` Function

The fcntl function can change the properties of a file that is already open.

#include <fcntl.h> int fcntl(int filedes, int cmd, ... /* int arg */ );

Returns: depends on cmd if OK (see following), 1 on error

In the examples in this section, the third argument is always an integer, corresponding to the comment in the function prototype just shown. But when we describe record locking in Section 14.3, the third argument becomes a pointer to a structure.

The fcntl function is used for five different purposes.

Duplicate an existing descriptor (cmd = F_DUPFD)
Get/set file descriptor flags (cmd = F_GETFD or F_SETFD)
Get/set file status flags (cmd = F_GETFL or F_SETFL)
Get/set asynchronous I/O ownership (cmd = F_GETOWN or F_SETOWN)
Get/set record locks (cmd = F_GETLK, F_SETLK, or F_SETLKW)

We'll now describe the first seven of these ten cmd values. (We'll wait until Section 14.3 to describe the last three, which deal with record locking.) Refer to Figure 3.6, since we'll be referring to both the file descriptor flags associated with each file descriptor in the process table entry and the file status flags associated with each file table entry.

F_DUPFD
Duplicate the file descriptor filedes. The new file descriptor is returned as the value of the function. It is the lowest-numbered descriptor that is not already open, that is greater than or equal to the third argument (taken as an integer). The new descriptor shares the same file table entry as filedes. (Refer to Figure 3.8.) But the new descriptor has its own set of file descriptor flags, and its FD_CLOEXEC file descriptor flag is cleared. (This means that the descriptor is left open across an exec, which we discuss in Chapter 8.)
F_GETFD
Return the file descriptor flags for filedes as the value of the function. Currently, only one file descriptor flag is defined: the FD_CLOEXEC flag.
F_SETFD
Set the file descriptor flags for filedes. The new flag value is set from the third argument (taken as an integer).

Be aware that some existing programs that deal with the file descriptor flags don't use the constant FD_CLOEXEC. Instead, the programs set the flag to either 0 (don't close-on-exec, the default) or 1 (do close-on-exec).

F_GETFL
Return the file status flags for filedes as the value of the function. We described the file status flags when we described the open function. They are listed in Figure 3.9.

Figure 3.9. File status flags for fcntl
File status flag
Description
O_RDONLY
open for reading only
O_WRONLY
open for writing only
O_RDWR
open for reading and writing
O_APPEND
append on each write
O_NONBLOCK
nonblocking mode
O_SYNC
wait for writes to complete (data and attributes)
O_DSYNC
wait for writes to complete (data only)
O_RSYNC
synchronize reads and writes
O_FSYNC
wait for writes to complete (FreeBSD and Mac OS X only)
O_ASYNC
asynchronous I/O (FreeBSD and Mac OS X only)

Unfortunately, the three access-mode flagsO_RDONLY, O_WRONLY, and O_RDWRare not separate bits that can be tested. (As we mentioned earlier, these three often have the values 0, 1, and 2, respectively, for historical reasons. Also, these three values are mutually exclusive; a file can have only one of the three enabled.) Therefore, we must first use the O_ACCMODE mask to obtain the access-mode bits and then compare the result against any of the three values.
F_SETFL
Set the file status flags to the value of the third argument (taken as an integer). The only flags that can be changed are O_APPEND, O_NONBLOCK, O_SYNC, O_DSYNC, O_RSYNC, O_FSYNC, and O_ASYNC.
F_GETOWN
Get the process ID or process group ID currently receiving the SIGIO and SIGURG signals. We describe these asynchronous I/O signals in Section 14.6.2.
F_SETOWN
Set the process ID or process group ID to receive the SIGIO and SIGURG signals. A positive arg specifies a process ID. A negative arg implies a process group ID equal to the absolute value of arg.

The return value from fcntl depends on the command. All commands return 1 on an error or some other value if OK. The following four commands have special return values: F_DUPFD, F_GETFD, F_GETFL, and F_GETOWN. The first returns the new file descriptor, the next two return the corresponding flags, and the final one returns a positive process ID or a negative process group ID.

Example

The program in Figure 3.10 takes a single command-line argument that specifies a file descriptor and prints a description of selected file flags for that descriptor.

Note that we use the feature test macro _POSIX_C_SOURCE and conditionally compile the file access flags that are not part of POSIX.1. The following script shows the operation of the program, when invoked from bash (the Bourne-again shell). Results vary, depending on which shell you use.

      $ ./a.out 0 < /dev/tty
      read only
      $ ./a.out 1 > temp.foo
      $ cat temp.foo
      write only
      $ ./a.out 2 2>>temp.foo
      write only, append
      $ ./a.out 5 5<>temp.foo
      read write

The clause 5<>temp.foo opens the file temp.foo for reading and writing on file descriptor 5.

Figure 3.10. Print file flags for specified descriptor

 #include "apue.h"
 #include <fcntl.h>
 int
 main(int argc, char *argv[])
 {
 
     int       val;
 
     if (argc != 2)
         err_quit("usage: a.out <descriptor#>");
 
     if ((val = fcntl(atoi(argv[1]), F_GETFL, 0)) < 0)
         err_sys("fcntl error for fd %d", atoi(argv[1]));
 
     switch (val & O_ACCMODE) {
     case O_RDONLY:
         printf("read only");
         break;
 
     case O_WRONLY:
         printf("write only");
         break;
 
     case O_RDWR:
         printf("read write");
         break;
 
     default:
         err_dump("unknown access mode");
     }
 
     if (val & O_APPEND)
         printf(", append");
     if (val & O_NONBLOCK)
         printf(", nonblocking");
 #if defined(O_SYNC)
     if (val & O_SYNC)
         printf(", synchronous writes");
 #endif
 #if !defined(_POSIX_C_SOURCE) && defined(O_FSYNC)
     if (val & O_FSYNC)
         printf(", synchronous writes");
 #endif
     putchar('\n');
     exit(0);
 }

Example

When we modify either the file descriptor flags or the file status flags, we must be careful to fetch the existing flag value, modify it as desired, and then set the new flag value. We can't simply do an F_SETFD or an F_SETFL, as this could turn off flag bits that were previously set.

Figure 3.11 shows a function that sets one or more of the file status flags for a descriptor.

If we change the middle statement to

    val &= ^~flags;          /* turn flags off */

we have a function named clr_fl, which we'll use in some later examples. This statement logically ANDs the one's complement of flags with the current val.

If we call set_fl from Figure 3.4 by adding the line

     set_fl(STDOUT_FILENO, O_SYNC);

at the beginning of the program, we'll turn on the synchronous-write flag. This causes each write to wait for the data to be written to disk before returning. Normally in the UNIX System, a write only queues the data for writing; the actual disk write operation can take place sometime later. A database system is a likely candidate for using O_SYNC, so that it knows on return from a write that the data is actually on the disk, in case of an abnormal system failure.

We expect the O_SYNC flag to increase the clock time when the program runs. To test this, we can run the program in Figure 3.4, copying 98.5 MB of data from one file on disk to another and compare this with a version that does the same thing with the O_SYNC flag set. The results from a Linux system using the ext2 file system are shown in Figure 3.12.

The six rows in Figure 3.12 were all measured with a BUFFSIZE of 4,096. The results in Figure 3.5 were measured reading a disk file and writing to /dev/null, so there was no disk output. The second row in Figure 3.12 corresponds to reading a disk file and writing to another disk file. This is why the first and second rows in Figure 3.12 are different. The system time increases when we write to a disk file, because the kernel now copies the data from our process and queues the data for writing by the disk driver. We expect the clock time to increase also when we write to a disk file, but it doesn't increase significantly for this test, which indicates that our writes go to the system cache, and we don't measure the cost to actually write the data to disk.

When we enable synchronous writes, the system time and the clock time should increase significantly. As the third row shows, the time for writing synchronously is about the same as when we used delayed writes. This implies that the Linux ext2 file system isn't honoring the O_SYNC flag. This suspicion is supported by the sixth line, which shows that the time to do synchronous writes followed by a call to fsync is just as large as calling fsync after writing the file without synchronous writes (line 5). After writing a file synchronously, we expect that a call to fsync will have no effect.

Figure 3.13 shows timing results for the same tests on Mac OS X 10.3. Note that the times match our expectations: synchronous writes are far more expensive than delayed writes, and using fsync with synchronous writes makes no measurable difference. Note also that adding a call to fsync at the end of the delayed writes makes no measurable difference. It is likely that the operating system flushed previously written data to disk as we were writing new data to the file, so by the time that we called fsync, very little work was left to be done.

Compare fsync and fdatasync, which update a file's contents when we say so, with the O_SYNC flag, which updates a file's contents every time we write to the file.

Figure 3.11. Turn on one or more of the file status flags for a descriptor

 #include "apue.h"
 #include <fcntl.h>
 
 void
 set_fl(int fd, int flags) /* flags are file status flags to turn on */
 {
     int     val;
 
     if ((val = fcntl(fd, F_GETFL, 0)) < 0)
         err_sys("fcntl F_GETFL error");
 
     val |= flags;       /* turn on flags */
 
     if (fcntl(fd, F_SETFL, val) < 0)
         err_sys("fcntl F_SETFL error");
 }

Figure 3.12. Linux ext2 timing results using various synchronization mechanisms
Operation
User CPU (seconds)
System CPU (seconds)
Clock time (seconds)
read time from Figure 3.5 for BUFFSIZE = 4,096
0.03
0.16
6.86
normal write to disk file
0.02
0.30
6.87
write to disk file with O_SYNC set
0.03
0.30
6.83
write to disk followed by fdatasync
0.03
0.42
18.28
write to disk followed by fsync
0.03
0.37
17.95
write to disk with O_SYNC set followed by fsync
0.05
0.44
17.95

Figure 3.13. Mac OS X timing results using various synchronization mechanisms
Operation
User CPU (seconds)
System CPU (seconds)
Clock time (seconds)
write to /dev/null
0.06
0.79
4.33
normal write to disk file
0.05
3.56
14.40
write to disk file with O_FSYNC set
0.13
9.53
22.48
write to disk followed by fsync
0.11
3.31
14.12
write to disk with O_FSYNC set followed by fsync
0.17
9.14
22.12

With this example, we see the need for fcntl. Our program operates on a descriptor (standard output), never knowing the name of the file that was opened by the shell on that descriptor. We can't set the O_SYNC flag when the file is opened, since the shell opened the file. With fcntl, we can modify the properties of a descriptor, knowing only the descriptor for the open file. We'll see another need for fcntl when we describe nonblocking pipes (Section 15.2), since all we have with a pipe is a descriptor.

3.15. `ioctl` Function

The ioctl function has always been the catchall for I/O operations. Anything that couldn't be expressed using one of the other functions in this chapter usually ended up being specified with an ioctl. Terminal I/O was the biggest user of this function. (When we get to Chapter 18, we'll see that POSIX.1 has replaced the terminal I/O operations with separate functions.)

#include <unistd.h> /* System V */ #include <sys/ioctl.h> /* BSD and Linux */ #include <stropts.h> /* XSI STREAMS */ int ioctl(int filedes, int request, ...);

Returns: 1 on error, something else if OK

The ioctl function is included in the Single UNIX Specification only as an extension for dealing with STREAMS devices [Rago 1993]. UNIX System implementations, however, use it for many miscellaneous device operations. Some implementations have even extended it for use with regular files.

The prototype that we show corresponds to POSIX.1. FreeBSD 5.2.1 and Mac OS X 10.3 declare the second argument as an unsigned long. This detail doesn't matter, since the second argument is always a #defined name from a header.

For the ISO C prototype, an ellipsis is used for the remaining arguments. Normally, however, there is only one more argument, and it's usually a pointer to a variable or a structure.

In this prototype, we show only the headers required for the function itself. Normally, additional device-specific headers are required. For example, the ioctl commands for terminal I/O, beyond the basic operations specified by POSIX.1, all require the <termios.h> header.

Each device driver can define its own set of ioctl commands. The system, however, provides generic ioctl commands for different classes of devices. Examples of some of the categories for these generic ioctl commands supported in FreeBSD are summarized in Figure 3.14.

Figure 3.14. Common FreeBSD ioctl operations
Category
Constant names
Header
Number of ioctls
disk labels
DIOxxx
<sys/disklabel.h>
6
file I/O
FIOxxx
<sys/filio.h>
9
mag tape I/O
MTIOxxx
<sys/mtio.h>
11
socket I/O
SIOxxx
<sys/sockio.h>
60
terminal I/O
TIOxxx
<sys/ttycom.h>
44

The mag tape operations allow us to write end-of-file marks on a tape, rewind a tape, space forward over a specified number of files or records, and the like. None of these operations is easily expressed in terms of the other functions in the chapter (read, write, lseek, and so on), so the easiest way to handle these devices has always been to access their operations using ioctl.

We use the ioctl function in Section 14.4 when we describe the STREAMS system, in Section 18.12 to fetch and set the size of a terminal's window, and in Section 19.7 when we access the advanced features of pseudo terminals.

3.16. `/dev/fd`

Newer systems provide a directory named /dev/fd whose entries are files named 0, 1, 2, and so on. Opening the file /dev/fd/n is equivalent to duplicating descriptor n, assuming that descriptor n is open.

The /dev/fd feature was developed by Tom Duff and appeared in the 8th Edition of the Research UNIX System. It is supported by all of the systems described in this book: FreeBSD 5.2.1, Linux 2.4.22, Mac OS X 10.3, and Solaris 9. It is not part of POSIX.1.

In the function call

     fd = open("/dev/fd/0", mode);

most systems ignore the specified mode, whereas others require that it be a subset of the mode used when the referenced file (standard input, in this case) was originally opened. Because the previous open is equivalent to

     fd = dup(0);

the descriptors 0 and fd share the same file table entry (Figure 3.8). For example, if descriptor 0 was opened read-only, we can only read on fd. Even if the system ignores the open mode, and the call

     fd = open("/dev/fd/0", O_RDWR);

succeeds, we still can't write to fd.

We can also call creat with a /dev/fd pathname argument, as well as specifying O_CREAT in a call to open. This allows a program that calls creat to still work if the pathname argument is /dev/fd/1, for example.

Some systems provide the pathnames /dev/stdin, /dev/stdout, and /dev/stderr. These pathnames are equivalent to /dev/fd/0, /dev/fd/1, and /dev/fd/2.

The main use of the /dev/fd files is from the shell. It allows programs that use pathname arguments to handle standard input and standard output in the same manner as other pathnames. For example, the cat(1) program specifically looks for an input filename of - and uses this to mean standard input. The command

     filter file2 | cat file1 - file3 | lpr

is an example. First, cat reads file1, next its standard input (the output of the filter program on file2), then file3. If /dev/fd is supported, the special handling of - can be removed from cat, and we can enter

     filter file2 | cat file1 /dev/fd/0 file3 | lpr

The special meaning of - as a command-line argument to refer to the standard input or standard output is a kludge that has crept into many programs. There are also problems if we specify - as the first file, as it looks like the start of another command-line option. Using /dev/fd is a step toward uniformity and cleanliness.

3.17. Summary

This chapter has described the basic I/O functions provided by the UNIX System. These are often called the unbuffered I/O functions because each read or write invokes a system call into the kernel. Using only read and write, we looked at the effect of various I/O sizes on the amount of time required to read a file. We also looked at several ways to flush written data to disk and their effect on application performance.

Atomic operations were introduced when multiple processes append to the same file and when multiple processes create the same file. We also looked at the data structures used by the kernel to share information about open files. We'll return to these data structures later in the text.

We also described the ioctl and fcntl functions. We return to both of these functions in Chapter 14, where we'll use ioctl with the STREAMS I/O system, and fcntl for record locking.

4.1. Introduction

In the previous chapter we covered the basic functions that perform I/O. The discussion centered around I/O for regular filesopening a file, and reading or writing a file. We'll now look at additional features of the file system and the properties of a file. We'll start with the stat functions and go through each member of the stat structure, looking at all the attributes of a file. In this process, we'll also describe each of the functions that modify these attributes: change the owner, change the permissions, and so on. We'll also look in more detail at the structure of a UNIX file system and symbolic links. We finish this chapter with the functions that operate on directories, and we develop a function that descends through a directory hierarchy.

4.2. `stat`, `fstat`, and `lstat` Functions

The discussion in this chapter centers around the three stat functions and the information they return.

[View full width]
#include <sys/stat.h> int stat(const char *restrict pathname, struct stat *restrict buf); int fstat(int filedes, struct stat *buf); int lstat(const char *restrict pathname, struct stat *restrict buf);

All three return: 0 if OK, 1 on error

Given a pathname, the stat function returns a structure of information about the named file. The fstat function obtains information about the file that is already open on the descriptor filedes. The lstat function is similar to stat, but when the named file is a symbolic link, lstat returns information about the symbolic link, not the file referenced by the symbolic link. (We'll need lstat in Section 4.21 when we walk down a directory hierarchy. We describe symbolic links in more detail in Section 4.16.)

The second argument is a pointer to a structure that we must supply. The function fills in the structure pointed to by buf. The definition of the structure can differ among implementations, but it could look like

      struct stat {
        mode_t    st_mode;      /* file type & mode (permissions) */
        ino_t     st_ino;       /* i-node number (serial number) */
        dev_t     st_dev;       /* device number (file system) */
        dev_t     st_rdev;      /* device number for special files */
        nlink_t   st_nlink;     /* number of links */
        uid_t     st_uid;       /* user ID of owner */
        gid_t     st_gid;       /* group ID of owner */
        off_t     st_size;      /* size in bytes, for regular files */
        time_t    st_atime;     /* time of last access */
        time_t    st_mtime;     /* time of last modification */
        time_t    st_ctime;     /* time of last file status change */
        blksize_t st_blksize;   /* best I/O block size */
        blkcnt_t  st_blocks;    /* number of disk blocks allocated */
      };

The st_rdev, st_blksize, and st_blocks fields are not required by POSIX.1. They are defined as XSI extensions in the Single UNIX Specification.

Note that each member is specified by a primitive system data type (see Section 2.8). We'll go through each member of this structure to examine the attributes of a file.

The biggest user of the stat functions is probably the ls -l command, to learn all the information about a file.

4.3. File Types

We've talked about two different types of files so far: regular files and directories. Most files on a UNIX system are either regular files or directories, but there are additional types of files. The types are:

Regular file. The most common type of file, which contains data of some form. There is no distinction to the UNIX kernel whether this data is text or binary. Any interpretation of the contents of a regular file is left to the application processing the file.

One notable exception to this is with binary executable files. To execute a program, the kernel must understand its format. All binary executable files conform to a format that allows the kernel to identify where to load a program's text and data.
Directory file. A file that contains the names of other files and pointers to information on these files. Any process that has read permission for a directory file can read the contents of the directory, but only the kernel can write directly to a directory file. Processes must use the functions described in this chapter to make changes to a directory.
Block special file. A type of file providing buffered I/O access in fixed-size units to devices such as disk drives.
Character special file. A type of file providing unbuffered I/O access in variable-sized units to devices. All devices on a system are either block special files or character special files.
FIFO. A type of file used for communication between processes. It's sometimes called a named pipe. We describe FIFOs in Section 15.5.
Socket. A type of file used for network communication between processes. A socket can also be used for non-network communication between processes on a single host. We use sockets for interprocess communication in Chapter 16.
Symbolic link. A type of file that points to another file. We talk more about symbolic links in Section 4.16.

The type of a file is encoded in the st_mode member of the stat structure. We can determine the file type with the macros shown in Figure 4.1. The argument to each of these macros is the st_mode member from the stat structure.

Figure 4.1. File type macros in <sys/stat.h>
Macro
Type of file
S_ISREG()
regular file
S_ISDIR()
directory file
S_ISCHR()
character special file
S_ISBLK()
block special file
S_ISFIFO()
pipe or FIFO
S_ISLNK()
symbolic link
S_ISSOCK()
socket

POSIX.1 allows implementations to represent interprocess communication (IPC) objects, such as message queues and semaphores, as files. The macros shown in Figure 4.2 allow us to determine the type of IPC object from the stat structure. Instead of taking the st_mode member as an argument, these macros differ from those in Figure 4.1 in that their argument is a pointer to the stat structure.

Figure 4.2. IPC type macros in <sys/stat.h>
Macro
Type of object
S_TYPEISMQ()
message queue
S_TYPEISSEM()
semaphore
S_TYPEISSHM()
shared memory object

Message queues, semaphores, and shared memory objects are discussed in Chapter 15. However, none of the various implementations of the UNIX System discussed in this book represent these objects as files.

Example

The program in Figure 4.3 prints the type of file for each command-line argument.

Sample output from Figure 4.3 is

     $ ./a.out /etc/passwd /etc /dev/initctl /dev/log /dev/tty \
     > /dev/scsi/host0/bus0/target0/lun0/cd /dev/cdrom
     /etc/passwd: regular
     /etc: directory
     /dev/initctl: fifo
     /dev/log: socket
     /dev/tty: character special
     /dev/scsi/host0/bus0/target0/lun0/cd: block special
     /dev/cdrom: symbolic link

(Here, we have explicitly entered a backslash at the end of the first command line, telling the shell that we want to continue entering the command on another line. The shell then prompts us with its secondary prompt, >, on the next line.) We have specifically used the lstat function instead of the stat function to detect symbolic links. If we used the stat function, we would never see symbolic links.

To compile this program on a Linux system, we must define _GNU_SOURCE to include the definition of the S_ISSOCK macro.

Figure 4.3. Print type of file for each command-line argument

 #include "apue.h"
 
 int
 main(int argc, char *argv[])
 {
 
     int         i;
     struct stat buf;
     char        *ptr;
 
     for (i = 1; i < argc; i++) {
         printf("%s: ", argv[i]);
         if (lstat(argv[i], &buf) < 0) {
             err_ret("lstat error");
             continue;
 
          }
          if (S_ISREG(buf.st_mode))
             ptr = "regular";
          else if (S_ISDIR(buf.st_mode))
             ptr = "directory";
          else if (S_ISCHR(buf.st_mode))
             ptr = "character special";
          else if (S_ISBLK(buf.st_mode))
             ptr = "block special";
          else if (S_ISFIFO(buf.st_mode))
             ptr = "fifo";
          else if (S_ISLNK(buf.st_mode))
             ptr = "symbolic link";
          else if (S_ISSOCK(buf.st_mode))
             ptr = "socket";
          else
             ptr = "** unknown mode **";
          printf("%s\n", ptr);
   }
    exit(0);
 }

Historically, early versions of the UNIX System didn't provide the S_ISxxx macros. Instead, we had to logically AND the st_mode value with the mask S_IFMT and then compare the result with the constants whose names are S_IFxxx. Most systems define this mask and the related constants in the file <sys/stat.h>. If we examine this file, we'll find the S_ISDIR macro defined something like

     #define S_ISDIR(mode) (((mode) & S_IFMT) == S_IFDIR)

We've said that regular files are predominant, but it is interesting to see what percentage of the files on a given system are of each file type. Figure 4.4 shows the counts and percentages for a Linux system that is used as a single-user workstation. This data was obtained from the program that we show in Section 4.21.

Figure 4.4. Counts and percentages of different file types
File type
Count
Percentage
regular file
226,856
88.22 %
directory
23,017
8.95
symbolic link
6,442
2.51
character special
447
0.17
block special
312
0.12
socket
69
0.03
FIFO
1
0.00

4.4. Set-User-ID and Set-Group-ID

Every process has six or more IDs associated with it. These are shown in Figure 4.5.

Figure 4.5. User IDs and group IDs associated with each process
real user ID
real group ID
who we really are
effective user ID
effective group ID
supplementary group IDs
used for file access permission checks
saved set-user-ID
saved set-group-ID
saved by exec functions

The real user ID and real group ID identify who we really are. These two fields are taken from our entry in the password file when we log in. Normally, these values don't change during a login session, although there are ways for a superuser process to change them, which we describe in Section 8.11.
The effective user ID, effective group ID, and supplementary group IDs determine our file access permissions, as we describe in the next section. (We defined supplementary group IDs in Section 1.8.)
The saved set-user-ID and saved set-group-ID contain copies of the effective user ID and the effective group ID when a program is executed. We describe the function of these two saved values when we describe the setuid function in Section 8.11.

The saved IDs are required with the 2001 version of POSIX.1. They used to be optional in older versions of POSIX. An application can test for the constant _POSIX_SAVED_IDS at compile time or can call sysconf with the _SC_SAVED_IDS argument at runtime, to see whether the implementation supports this feature.

Normally, the effective user ID equals the real user ID, and the effective group ID equals the real group ID.

Every file has an owner and a group owner. The owner is specified by the st_uid member of the stat structure; the group owner, by the st_gid member.

When we execute a program file, the effective user ID of the process is usually the real user ID, and the effective group ID is usually the real group ID. But the capability exists to set a special flag in the file's mode word (st_mode) that says "when this file is executed, set the effective user ID of the process to be the owner of the file (st_uid)." Similarly, another bit can be set in the file's mode word that causes the effective group ID to be the group owner of the file (st_gid). These two bits in the file's mode word are called the set-user-ID bit and the set-group-ID bit.

For example, if the owner of the file is the superuser and if the file's set-user-ID bit is set, then while that program file is running as a process, it has superuser privileges. This happens regardless of the real user ID of the process that executes the file. As an example, the UNIX System program that allows anyone to change his or her password, passwd(1), is a set-user-ID program. This is required so that the program can write the new password to the password file, typically either /etc/passwd or /etc/shadow, files that should be writable only by the superuser. Because a process that is running set-user-ID to some other user usually assumes extra permissions, it must be written carefully. We'll discuss these types of programs in more detail in Chapter 8.

Returning to the stat function, the set-user-ID bit and the set-group-ID bit are contained in the file's st_mode value. These two bits can be tested against the constants S_ISUID and S_ISGID.

4.5. File Access Permissions

The st_mode value also encodes the access permission bits for the file. When we say file, we mean any of the file types that we described earlier. All the file typesdirectories, character special files, and so onhave permissions. Many people think only of regular files as having access permissions.

There are nine permission bits for each file, divided into three categories. These are shown in Figure 4.6.

Figure 4.6. The nine file access permission bits, from <sys/stat.h>
st_mode mask
Meaning
S_IRUSR
user-read
S_IWUSR
user-write
S_IXUSR
user-execute
S_IRGRP
group-read
S_IWGRP
group-write
S_IXGRP
group-execute
S_IROTH
other-read
S_IWOTH
other-write
S_IXOTH
other-execute

The term user in the first three rows in Figure 4.6 refers to the owner of the file. The chmod(1) command, which is typically used to modify these nine permission bits, allows us to specify u for user (owner), g for group, and o for other. Some books refer to these three as owner, group, and world; this is confusing, as the chmod command uses o to mean other, not owner. We'll use the terms user, group, and other, to be consistent with the chmod command.

The three categories in Figure 4.6read, write, and executeare used in various ways by different functions. We'll summarize them here, and return to them when we describe the actual functions.

The first rule is that whenever we want to open any type of file by name, we must have execute permission in each directory mentioned in the name, including the current directory, if it is implied. This is why the execute permission bit for a directory is often called the search bit.
For example, to open the file /usr/include/stdio.h, we need execute permission in the directory /, execute permission in the directory /usr, and execute permission in the directory /usr/include. We then need appropriate permission for the file itself, depending on how we're trying to open it: read-only, readwrite, and so on.
If the current directory is /usr/include, then we need execute permission in the current directory to open the file stdio.h. This is an example of the current directory being implied, not specifically mentioned. It is identical to our opening the file ./stdio.h.
Note that read permission for a directory and execute permission for a directory mean different things. Read permission lets us read the directory, obtaining a list of all the filenames in the directory. Execute permission lets us pass through the directory when it is a component of a pathname that we are trying to access. (We need to search the directory to look for a specific filename.)
Another example of an implicit directory reference is if the PATH environment variable, described in Section 8.10, specifies a directory that does not have execute permission enabled. In this case, the shell will never find executable files in that directory.
The read permission for a file determines whether we can open an existing file for reading: the O_RDONLY and O_RDWR flags for the open function.
The write permission for a file determines whether we can open an existing file for writing: the O_WRONLY and O_RDWR flags for the open function.
We must have write permission for a file to specify the O_TRUNC flag in the open function.
We cannot create a new file in a directory unless we have write permission and execute permission in the directory.
To delete an existing file, we need write permission and execute permission in the directory containing the file. We do not need read permission or write permission for the file itself.
Execute permission for a file must be on if we want to execute the file using any of the six exec functions (Section 8.10). The file also has to be a regular file.

The file access tests that the kernel performs each time a process opens, creates, or deletes a file depend on the owners of the file (st_uid and st_gid), the effective IDs of the process (effective user ID and effective group ID), and the supplementary group IDs of the process, if supported. The two owner IDs are properties of the file, whereas the two effective IDs and the supplementary group IDs are properties of the process. The tests performed by the kernel are as follows.

If the effective user ID of the process is 0 (the superuser), access is allowed. This gives the superuser free rein throughout the entire file system.
If the effective user ID of the process equals the owner ID of the file (i.e., the process owns the file), access is allowed if the appropriate user access permission bit is set. Otherwise, permission is denied. By appropriate access permission bit, we mean that if the process is opening the file for reading, the user-read bit must be on. If the process is opening the file for writing, the user-write bit must be on. If the process is executing the file, the user-execute bit must be on.
If the effective group ID of the process or one of the supplementary group IDs of the process equals the group ID of the file, access is allowed if the appropriate group access permission bit is set. Otherwise, permission is denied.
If the appropriate other access permission bit is set, access is allowed. Otherwise, permission is denied.

These four steps are tried in sequence. Note that if the process owns the file (step 2), access is granted or denied based only on the user access permissions; the group permissions are never looked at. Similarly, if the process does not own the file, but belongs to an appropriate group, access is granted or denied based only on the group access permissions; the other permissions are not looked at.

4.6. Ownership of New Files and Directories

When we described the creation of a new file in Chapter 3, using either open or creat, we never said what values were assigned to the user ID and group ID of the new file. We'll see how to create a new directory in Section 4.20 when we describe the mkdir function. The rules for the ownership of a new directory are identical to the rules in this section for the ownership of a new file.

The user ID of a new file is set to the effective user ID of the process. POSIX.1 allows an implementation to choose one of the following options to determine the group ID of a new file.

The group ID of a new file can be the effective group ID of the process.
The group ID of a new file can be the group ID of the directory in which the file is being created.

FreeBSD 5.2.1 and Mac OS X 10.3 always uses the group ID of the directory as the group ID of the new file.

The Linux ext2 and ext3 file systems allow the choice between these two POSIX.1 options to be made on a file system basis, using a special flag to the mount(1) command. On Linux 2.4.22 (with the proper mount option) and Solaris 9, the group ID of a new file depends on whether the set-group-ID bit is set for the directory in which the file is being created. If this bit is set for the directory, the group ID of the new file is set to the group ID of the directory; otherwise, the group ID of the new file is set to the effective group ID of the process.

Using the second optioninheriting the group ID of the directoryassures us that all files and directories created in that directory will have the group ID belonging to the directory. This group ownership of files and directories will then propagate down the hierarchy from that point. This is used, for example, in the /var/spool/mail directory on Linux.

As we mentioned, this option for group ownership is the default for FreeBSD 5.2.1 and Mac OS X 10.3, but an option for Linux and Solaris. Under Linux 2.4.22 and Solaris 9, we have to enable the set-group-ID bit, and the mkdir function has to propagate a directory's set-group-ID bit automatically for this to work. (This is described in Section 4.20.)

4.7. `access` Function

As we described earlier, when we open a file, the kernel performs its access tests based on the effective user and group IDs. There are times when a process wants to test accessibility based on the real user and group IDs. This is useful when a process is running as someone else, using either the set-user-ID or the set-group-ID feature. Even though a process might be set-user-ID to root, it could still want to verify that the real user can access a given file. The access function bases its tests on the real user and group IDs. (Replace effective with real in the four steps at the end of Section 4.5.)

#include <unistd.h> int access(const char *pathname, int mode);

Returns: 0 if OK, 1 on error

The mode is the bitwise OR of any of the constants shown in Figure 4.7.

Figure 4.7. The mode constants for access function, from <unistd.h>
mode
Description
R_OK
test for read permission
W_OK
test for write permission
X_OK
test for execute permission
F_OK
test for existence of file

Example

Figure 4.8 shows the use of the access function.

Here is a sample session with this program:

          $ ls -l a.out
          -rwxrwxr-x 1 sar         15945 Nov 30 12:10 a.out
          $ ./a.out a.out
          read access OK
          open for reading OK
          $ ls -l /etc/shadow
          -r-------- 1 root         1315 Jul 17 2002 /etc/shadow
          $ ./a.out /etc/shadow
          access error for /etc/shadow: Permission denied
          open error for /etc/shadow: Permission denied
          $ su                        become superuser
          Password:                  enter superuser password
          # chown root a.out         change file's user ID to root
          # chmod u+s a.out          and turn on set-user-ID bit
          # ls -l a.out              check owner and SUID bit
          -rwsrwxr-x 1 root     15945 Nov 30 12:10 a.out
          # exit                     go back to normal user
          $ ./a.out /etc/shadow
          access error for /etc/shadow: Permission denied
          open for reading OK

In this example, the set-user-ID program can determine that the real user cannot normally read the file, even though the open function will succeed.

Figure 4.8. Example of `access` function

#include "apue.h"
 #include <fcntl.h>
 
 int
 main(int argc, char *argv[])
 {
     if (argc != 2)
         err_quit("usage: a.out <pathname>");
     if (access(argv[1], R_OK) < 0)
         err_ret("access error for %s", argv[1]);
     else
         printf("read access OK\n");
     if (open(argv[1], O_RDONLY) < 0)
         err_ret("open error for %s", argv[1]);
     else
         printf("open for reading OK\n");
     exit(0);
 }

In the preceding example and in Chapter 8, we'll sometimes switch to become the superuser, to demonstrate how something works. If you're on a multiuser system and do not have superuser permission, you won't be able to duplicate these examples completely.

4.8. `umask` Function

Now that we've described the nine permission bits associated with every file, we can describe the file mode creation mask that is associated with every process.

The umask function sets the file mode creation mask for the process and returns the previous value. (This is one of the few functions that doesn't have an error return.)

#include <sys/stat.h> mode_t umask(mode_t cmask);

Returns: previous file mode creation mask

The cmask argument is formed as the bitwise OR of any of the nine constants from Figure 4.6: S_IRUSR, S_IWUSR, and so on.

The file mode creation mask is used whenever the process creates a new file or a new directory. (Recall from Sections 3.3 and 3.4 our description of the open and creat functions. Both accept a mode argument that specifies the new file's access permission bits.) We describe how to create a new directory in Section 4.20. Any bits that are on in the file mode creation mask are turned off in the file's mode.

Example

The program in Figure 4.9 creates two files, one with a umask of 0 and one with a umask that disables all the group and other permission bits.

If we run this program, we can see how the permission bits have been set.

        $ umask                    first print the current file mode creation mask
        002
        $ ./a.out
        $ ls -l foo bar
        -rw------- 1 sar            0 Dec 7 21:20 bar
        -rw-rw-rw- 1 sar            0 Dec 7 21:20 foo
        $ umask                    see if the file mode creation mask changed
        002

Figure 4.9. Example of `umask` function

 #include "apue.h"
 #include <fcntl.h>
 
 #define RWRWRW (S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP|S_IROTH|S_IWOTH)
 
 int
 main(void)
 {
     umask(0);
     if (creat("foo", RWRWRW) < 0)
         err_sys("creat error for foo");
     umask(S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH);
     if (creat("bar", RWRWRW) < 0)
         err_sys("creat error for bar");
     exit(0);
 }

Most users of UNIX systems never deal with their umask value. It is usually set once, on login, by the shell's start-up file, and never changed. Nevertheless, when writing programs that create new files, if we want to ensure that specific access permission bits are enabled, we must modify the umask value while the process is running. For example, if we want to ensure that anyone can read a file, we should set the umask to 0. Otherwise, the umask value that is in effect when our process is running can cause permission bits to be turned off.

In the preceding example, we use the shell's umask command to print the file mode creation mask before we run the program and after. This shows us that changing the file mode creation mask of a process doesn't affect the mask of its parent (often a shell). All of the shells have a built-in umask command that we can use to set or print the current file mode creation mask.

Users can set the umask value to control the default permissions on the files they create. The value is expressed in octal, with one bit representing one permission to be masked off, as shown in Figure 4.10. Permissions can be denied by setting the corresponding bits. Some common umask values are 002 to prevent others from writing your files, 022 to prevent group members and others from writing your files, and 027 to prevent group members from writing your files and others from reading, writing, or executing your files.

Figure 4.10. The umask file access permission bits
Mask bit
Meaning
0400
user-read
0200
user-write
0100
user-execute
0040
group-read
0020
group-write
0010
group-execute
0004
other-read
0002
other-write
0001
other-execute

The Single UNIX Specification requires that the shell support a symbolic form of the umask command. Unlike the octal format, the symbolic format specifies which permissions are to be allowed (i.e., clear in the file creation mask) instead of which ones are to be denied (i.e., set in the file creation mask). Compare both forms of the command, shown below.

        $ umask                        first print the current file mode creation mask
        002
        $ umask -S                     print the symbolic form
        u=rwx,g=rwx,o=rx
        $ umask 027                    change the file mode creation mask
        $ umask -S                     print the symbolic form
        u=rwx,g=rx,o=

4.9. `chmod` and `fchmod` Functions

These two functions allow us to change the file access permissions for an existing file.

#include <sys/stat.h> int chmod(const char *pathname, mode_t mode); int fchmod(int filedes, mode_t mode);

Both return: 0 if OK, 1 on error

The chmod function operates on the specified file, whereas the fchmod function operates on a file that has already been opened.

To change the permission bits of a file, the effective user ID of the process must be equal to the owner ID of the file, or the process must have superuser permissions.

The mode is specified as the bitwise OR of the constants shown in Figure 4.11.

Figure 4.11. The mode constants for chmod functions, from <sys/stat.h>
mode
Description
S_ISUID
set-user-ID on execution
S_ISGID
set-group-ID on execution
S_ISVTX
saved-text (sticky bit)
S_IRWXU
read, write, and execute by user (owner)
S_IRUSR
read by user (owner)
S_IWUSR
write by user (owner)
S_IXUSR
execute by user (owner)
S_IRWXG
read, write, and execute by group
S_IRGRP
read by group
S_IWGRP
write by group
S_IXGRP
execute by group
S_IRWXO
read, write, and execute by other (world)
S_IROTH
read by other (world)
S_IWOTH
write by other (world)
S_IXOTH
execute by other (world)

Note that nine of the entries in Figure 4.11 are the nine file access permission bits from Figure 4.6. We've added the two set-ID constants (S_ISUID and S_ISGID), the saved-text constant (S_ISVTX), and the three combined constants (S_IRWXU, S_IRWXG, and S_IRWXO).

The saved-text bit (S_ISVTX) is not part of POSIX.1. It is defined as an XSI extension in the Single UNIX Specification. We describe its purpose in the next section.

Example

Recall the final state of the files foo and bar when we ran the program in Figure 4.9 to demonstrate the umask function:

        $ ls -l foo bar
        -rw------- 1 sar                0 Dec 7 21:20 bar
        -rw-rw-rw- 1 sar                0 Dec 7 21:20 foo

The program shown in Figure 4.12 modifies the mode of these two files.

After running the program in Figure 4.12, we see that the final state of the two files is

      $ ls -l foo bar
      -rw-r--r-- 1 sar           0 Dec 7 21:20 bar
      -rw-rwSrw- 1 sar           0 Dec 7 21:20 foo

In this example, we have set the permissions of the file bar to an absolute value, regardless of the current permission bits. For the file foo, we set the permissions relative to their current state. To do this, we first call stat to obtain the current permissions and then modify them. We have explicitly turned on the set-group-ID bit and turned off the group-execute bit. Note that the ls command lists the group-execute permission as S to signify that the set-group-ID bit is set without the group-execute bit being set.

On Solaris, the ls command displays an l instead of an S to indicate that mandatory file and record locking has been enabled for this file. This applies only to regular files, but we'll discuss this more in Section 14.3.

Finally, note that the time and date listed by the ls command did not change after we ran the program in Figure 4.12. We'll see in Section 4.18 that the chmod function updates only the time that the i-node was last changed. By default, the ls -l lists the time when the contents of the file were last modified.

Figure 4.12. Example of `chmod` function

#include "apue.h"
 
 int
 main(void)
 {
      struct stat      statbuf;
 
      /* turn on set-group-ID and turn off group-execute */
 
      if (stat("foo", &statbuf) < 0)
          err_sys("stat error for foo");
      if (chmod("foo", (statbuf.st_mode & ~S_IXGRP) | S_ISGID) < 0)
          err_sys("chmod error for foo");
 
      /* set absolute mode to "rw-r--r--" */
 
      if (chmod("bar", S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH) < 0)
          err_sys("chmod error for bar");
 
      exit(0);
 }

The chmod functions automatically clear two of the permission bits under the following conditions:

On systems, such as Solaris, that place special meaning on the sticky bit when used with regular files, if we try to set the sticky bit (S_ISVTX) on a regular file and do not have superuser privileges, the sticky bit in the mode is automatically turned off. (We describe the sticky bit in the next section.) This means that only the superuser can set the sticky bit of a regular file. The reason is to prevent malicious users from setting the sticky bit and adversely affecting system performance.

On FreeBSD 5.2.1, Mac OS X 10.3, and Solaris 9, only the superuser can set the sticky bit on a regular file. Linux 2.4.22 places no such restriction on the setting of the sticky bit, because the bit has no meaning when applied to regular files on Linux. Although the bit also has no meaning when applied to regular files on FreeBSD and Mac OS X, these systems prevent everyone but the superuser from setting it on a regular file.
It is possible that the group ID of a newly created file is a group that the calling process does not belong to. Recall from Section 4.6 that it's possible for the group ID of the new file to be the group ID of the parent directory. Specifically, if the group ID of the new file does not equal either the effective group ID of the process or one of the process's supplementary group IDs and if the process does not have superuser privileges, then the set-group-ID bit is automatically turned off. This prevents a user from creating a set-group-ID file owned by a group that the user doesn't belong to.

FreeBSD 5.2.1, Linux 2.4.22, Mac OS X 10.3, and Solaris 9 add another security feature to try to prevent misuse of some of the protection bits. If a process that does not have superuser privileges writes to a file, the set-user-ID and set-group-ID bits are automatically turned off. If malicious users find a set-group-ID or a set-user-ID file they can write to, even though they can modify the file, they lose the special privileges of the file.

4.10. Sticky Bit

The S_ISVTX bit has an interesting history. On versions of the UNIX System that predated demand paging, this bit was known as the sticky bit. If it was set for an executable program file, then the first time the program was executed, a copy of the program's text was saved in the swap area when the process terminated. (The text portion of a program is the machine instructions.) This caused the program to load into memory more quickly the next time it was executed, because the swap area was handled as a contiguous file, compared to the possibly random location of data blocks in a normal UNIX file system. The sticky bit was often set for common application programs, such as the text editor and the passes of the C compiler. Naturally, there was a limit to the number of sticky files that could be contained in the swap area before running out of swap space, but it was a useful technique. The name sticky came about because the text portion of the file stuck around in the swap area until the system was rebooted. Later versions of the UNIX System referred to this as the saved-text bit; hence, the constant S_ISVTX. With today's newer UNIX systems, most of which have a virtual memory system and a faster file system, the need for this technique has disappeared.

On contemporary systems, the use of the sticky bit has been extended. The Single UNIX Specification allows the sticky bit to be set for a directory. If the bit is set for a directory, a file in the directory can be removed or renamed only if the user has write permission for the directory and one of the following:

Owns the file
Owns the directory
Is the superuser

The directories /tmp and /var/spool/uucppublic are typical candidates for the sticky bitthey are directories in which any user can typically create files. The permissions for these two directories are often read, write, and execute for everyone (user, group, and other). But users should not be able to delete or rename files owned by others.

The saved-text bit is not part of POSIX.1. It is an XSI extension to the basic POSIX.1 functionality defined in the Single UNIX Specification, and is supported by FreeBSD 5.2.1, Linux 2.4.22, Mac OS X 10.3, and Solaris 9.

Solaris 9 places special meaning on the sticky bit if it is set on a regular file. In this case, if none of the execute bits is set, the operating system will not cache the contents of the file.

4.11. `chown`, `fchown`, and `lchown` Functions

The chown functions allow us to change the user ID of a file and the group ID of a file.

[View full width]
#include <unistd.h> int chown(const char *pathname, uid_t owner, gid_t group); int fchown(int filedes, uid_t owner, gid_t group); int lchown(const char *pathname, uid_t owner, gid_t group);

All three return: 0 if OK, 1 on error

These three functions operate similarly unless the referenced file is a symbolic link. In that case, lchown changes the owners of the symbolic link itself, not the file pointed to by the symbolic link.

The lchown function is an XSI extension to the POSIX.1 functionality defined in the Single UNIX Specification. As such, all UNIX System implementations are expected to provide it.

If either of the arguments owner or group is -1, the corresponding ID is left unchanged.

Historically, BSD-based systems have enforced the restriction that only the superuser can change the ownership of a file. This is to prevent users from giving away their files to others, thereby defeating any disk space quota restrictions. System V, however, has allowed any user to change the ownership of any files they own.

POSIX.1 allows either form of operation, depending on the value of _POSIX_CHOWN_RESTRICTED.

With Solaris 9, this functionality is a configuration option, whose default value is to enforce the restriction. FreeBSD 5.2.1, Linux 2.4.22, and Mac OS X 10.3 always enforce the chown restriction.

Recall from Section 2.6 that the _POSIX_CHOWN_RESTRICTED constant can optionally be defined in the header <unistd.h>, and can always be queried using either the pathconf function or the fpathconf function. Also recall that this option can depend on the referenced file; it can be enabled or disabled on a per file system basis. We'll use the phrase, if _POSIX_CHOWN_RESTRICTED is in effect, to mean if it applies to the particular file that we're talking about, regardless of whether this actual constant is defined in the header.

If _POSIX_CHOWN_RESTRICTED is in effect for the specified file, then

Only a superuser process can change the user ID of the file.
A nonsuperuser process can change the group ID of the file if the process owns the file (the effective user ID equals the user ID of the file), owner is specified as 1 or equals the user ID of the file, and group equals either the effective group ID of the process or one of the process's supplementary group IDs.

This means that when _POSIX_CHOWN_RESTRICTED is in effect, you can't change the user ID of other users' files. You can change the group ID of files that you own, but only to groups that you belong to.

If these functions are called by a process other than a superuser process, on successful return, both the set-user-ID and the set-group-ID bits are cleared.

4.12. File Size

The st_size member of the stat structure contains the size of the file in bytes. This field is meaningful only for regular files, directories, and symbolic links.

Solaris also defines the file size for a pipe as the number of bytes that are available for reading from the pipe. We'll discuss pipes in Section 15.2.

For a regular file, a file size of 0 is allowed. We'll get an end-of-file indication on the first read of the file.

For a directory, the file size is usually a multiple of a number, such as 16 or 512. We talk about reading directories in Section 4.21.

For a symbolic link, the file size is the number of bytes in the filename. For example, in the following case, the file size of 7 is the length of the pathname usr/lib:

     lrwxrwxrwx 1 root           7 Sep 25 07:14 lib -> usr/lib

(Note that symbolic links do not contain the normal C null byte at the end of the name, as the length is always specified by st_size.)

Most contemporary UNIX systems provide the fields st_blksize and st_blocks. The first is the preferred block size for I/O for the file, and the latter is the actual number of 512-byte blocks that are allocated. Recall from Section 3.9 that we encountered the minimum amount of time required to read a file when we used st_blksize for the read operations. The standard I/O library, which we describe in Chapter 5, also tries to read or write st_blksize bytes at a time, for efficiency.

Be aware that different versions of the UNIX System use units other than 512-byte blocks for st_blocks. Using this value is nonportable.

Holes in a File

In Section 3.6, we mentioned that a regular file can contain "holes." We showed an example of this in Figure 3.2. Holes are created by seeking past the current end of file and writing some data. As an example, consider the following:

      $ ls -l core
      -rw-r--r-- 1 sar       8483248 Nov 18 12:18 core
      $ du -s core
      272        core

The size of the file core is just over 8 MB, yet the du command reports that the amount of disk space used by the file is 272 512-byte blocks (139,264 bytes). (The du command on many BSD-derived systems reports the number of 1,024-byte blocks; Solaris reports the number of 512-byte blocks.) Obviously, this file has many holes.

As we mentioned in Section 3.6, the read function returns data bytes of 0 for any byte positions that have not been written. If we execute the following, we can see that the normal I/O operations read up through the size of the file:

      $ wc -c core
       8483248 core

The wc(1) command with the -c option counts the number of characters (bytes) in the file.

If we make a copy of this file, using a utility such as cat(1), all these holes are written out as actual data bytes of 0:

        $ cat core > core.copy
        $ ls -l core*
        -rw-r--r--  1 sar      8483248 Nov 18 12:18 core
        -rw-rw-r--  1 sar      8483248 Nov 18 12:27 core.copy
        $ du -s core*
        272     core
        16592   core.copy

Here, the actual number of bytes used by the new file is 8,495,104 (512 x 16,592). The difference between this size and the size reported by ls is caused by the number of blocks used by the file system to hold pointers to the actual data blocks.

Interested readers should refer to Section 4.2 of Bach [1986], Sections 7.2 and 7.3 of McKusick et al. [1996] (or Sections 8.2 and 8.3 in McKusick and Neville-Neil [2005]), and Section 14.2 of Mauro and McDougall [2001] for additional details on the physical layout of files.

4.13. File Truncation

There are times when we would like to truncate a file by chopping off data at the end of the file. Emptying a file, which we can do with the O_TRUNC flag to open, is a special case of truncation.

#include <unistd.h> int truncate(const char *pathname, off_t length); int ftruncate(int filedes, off_t length);

Both return: 0 if OK, 1 on error

These two functions truncate an existing file to length bytes. If the previous size of the file was greater than length, the data beyond length is no longer accessible. If the previous size was less than length, the effect is system dependent, but XSI-conforming systems will increase the file size. If the implementation does extend a file, data between the old end of file and the new end of file will read as 0 (i.e., a hole is probably created in the file).

The ftruncate function is part of POSIX.1. The truncate function is an XSI extension to the POSIX.1 functionality defined in the Single UNIX Specification.

BSD releases prior to 4.4BSD could only make a file smaller with TRuncate.

Solaris also includes an extension to fcntl (F_FREESP) that allows us to free any part of a file, not just a chunk at the end of the file.

We use ftruncate in the program shown in Figure 13.6 when we need to empty a file after obtaining a lock on the file.

4.14. File Systems

To appreciate the concept of links to a file, we need a conceptual understanding of the structure of the UNIX file system. Understanding the difference between an i-node and a directory entry that points to an i-node is also useful.

Various implementations of the UNIX file system are in use today. Solaris, for example, supports several different types of disk file systems: the traditional BSD-derived UNIX file system (called UFS), a file system (called PCFS) to read and write DOS-formatted diskettes, and a file system (called HSFS) to read CD file systems. We saw one difference between file system types in Figure 2.19. UFS is based on the Berkeley fast file system, which we describe in this section.

We can think of a disk drive being divided into one or more partitions. Each partition can contain a file system, as shown in Figure 4.13.

Figure 4.13. Disk drive, partitions, and a file system

[View full size image]

The i-nodes are fixed-length entries that contain most of the information about a file.

If we examine the i-node and data block portion of a cylinder group in more detail, we could have what is shown in Figure 4.14.

Figure 4.14. Cylinder group's i-nodes and data blocks in more detail

[View full size image]

Note the following points from Figure 4.14.

We show two directory entries that point to the same i-node entry. Every i-node has a link count that contains the number of directory entries that point to the i-node. Only when the link count goes to 0 can the file be deleted (i.e., can the data blocks associated with the file be released). This is why the operation of "unlinking a file" does not always mean "deleting the blocks associated with the file." This is why the function that removes a directory entry is called unlink, not delete. In the stat structure, the link count is contained in the st_nlink member. Its primitive system data type is nlink_t. These types of links are called hard links. Recall from Section 2.5.2 that the POSIX.1 constant LINK_MAX specifies the maximum value for a file's link count.
The other type of link is called a symbolic link. With a symbolic link, the actual contents of the filethe data blocksstore the name of the file that the symbolic link points to. In the following example, the filename in the directory entry is the three-character string lib and the 7 bytes of data in the file are usr/lib:
```
     lrwxrwxrwx 1 root         7 Sep 25 07:14 lib -> usr/lib
 
```
The file type in the i-node would be S_IFLNK so that the system knows that this is a symbolic link.
The i-node contains all the information about the file: the file type, the file's access permission bits, the size of the file, pointers to the file's data blocks, and so on. Most of the information in the stat structure is obtained from the i-node. Only two items of interest are stored in the directory entry: the filename and the i-node number; the other itemsthe length of the filename and the length of the directory recordare not of interest to this discussion. The data type for the i-node number is ino_t.
Because the i-node number in the directory entry points to an i-node in the same file system, we cannot have a directory entry point to an i-node in a different file system. This is why the ln(1) command (make a new directory entry that points to an existing file) can't cross file systems. We describe the link function in the next section.
When renaming a file without changing file systems, the actual contents of the file need not be movedall that needs to be done is to add a new directory entry that points to the existing i-node, and then unlink the old directory entry. The link count will remain the same. For example, to rename the file /usr/lib/foo to /usr/foo, the contents of the file foo need not be moved if the directories /usr/lib and /usr are on the same file system. This is how the mv(1) command usually operates.

We've talked about the concept of a link count for a regular file, but what about the link count field for a directory? Assume that we make a new directory in the working directory, as in

    $ mkdir testdir

Figure 4.15 shows the result. Note that in this figure, we explicitly show the entries for dot and dot-dot.

Figure 4.15. Sample cylinder group after creating the directory `testdir`

[View full size image]

The i-node whose number is 2549 has a type field of "directory" and a link count equal to 2. Any leaf directory (a directory that does not contain any other directories) always has a link count of 2. The value of 2 is from the directory entry that names the directory (testdir) and from the entry for dot in that directory. The i-node whose number is 1267 has a type field of "directory" and a link count that is greater than or equal to 3. The reason we know that the link count is greater than or equal to 3 is that minimally, it is pointed to from the directory entry that names it (which we don't show in Figure 4.15), from dot, and from dot-dot in the testdir directory. Note that every subdirectory in a parent directory causes the parent directory's link count to be increased by 1.

This format is similar to the classic format of the UNIX file system, which is described in detail in Chapter 4 of Bach [1986]. Refer to Chapter 7 of McKusick et al. [1996] or Chapter 8 of McKusick and Neville-Neil [2005] for additional information on the changes made with the Berkeley fast file system. See Chapter 14 of Mauro and McDougall [2001] for details on UFS, the Solaris version of the Berkeley fast file system.

4.15. `link`, `unlink`, `remove`, and `rename` Functions

As we saw in the previous section, any file can have multiple directory entries pointing to its i-node. The way we create a link to an existing file is with the link function.

[View full width]
#include <unistd.h> int link(const char *existingpath, const char *newpath);

Returns: 0 if OK, 1 on error

This function creates a new directory entry, newpath, that references the existing file existingpath. If the newpath already exists, an error is returned. Only the last component of the newpath is created. The rest of the path must already exist.

The creation of the new directory entry and the increment of the link count must be an atomic operation. (Recall the discussion of atomic operations in Section 3.11.)

Most implementations require that both pathnames be on the same file system, although POSIX.1 allows an implementation to support linking across file systems. If an implementation supports the creation of hard links to directories, it is restricted to only the superuser. The reason is that doing this can cause loops in the file system, which most utilities that process the file system aren't capable of handling. (We show an example of a loop introduced by a symbolic link in Section 4.16.) Many file system implementations disallow hard links to directories for this reason.

To remove an existing directory entry, we call the unlink function.

#include <unistd.h> int unlink(const char *pathname);

Returns: 0 if OK, 1 on error

This function removes the directory entry and decrements the link count of the file referenced by pathname. If there are other links to the file, the data in the file is still accessible through the other links. The file is not changed if an error occurs.

We've mentioned before that to unlink a file, we must have write permission and execute permission in the directory containing the directory entry, as it is the directory entry that we will be removing. Also, we mentioned in Section 4.10 that if the sticky bit is set in this directory we must have write permission for the directory and one of the following:

Own the file
Own the directory
Have superuser privileges

Only when the link count reaches 0 can the contents of the file be deleted. One other condition prevents the contents of a file from being deleted: as long as some process has the file open, its contents will not be deleted. When a file is closed, the kernel first checks the count of the number of processes that have the file open. If this count has reached 0, the kernel then checks the link count; if it is 0, the file's contents are deleted.

Example

The program shown in Figure 4.16 opens a file and then unlinks it. The program then goes to sleep for 15 seconds before terminating.

Running this program gives us

     $ ls -l tempfile            look at how big the file is
     -rw-r----- 1 sar     413265408 Jan 21 07:14 tempfile
     $ df /home                  check how much free space is available
     Filesystem  1K-blocks     Used  Available  Use%  Mounted  on
     /dev/hda4    11021440  1956332    9065108   18%  /home
     $ ./a.out &                 run the program in Figure 4.16 in the background
     1364                        the shell prints its process ID
     $ file unlinked             the file is unlinked
     ls -l tempfile              see if the filename is still there
     ls: tempfile: No such file or directory           the directory entry is gone
     $ df /home                  see if the space is available yet
     Filesystem  1K-blocks     Used  Available  Use%  Mounted  on
     /dev/hda4    11021440  1956332    9065108   18%  /home
     $ done                      the program is done, all open files are closed
     df /home                    now the disk space should be available
     Filesystem  1K-blocks     Used  Available  Use%  Mounted on
     /dev/hda4    11021440  1552352    9469088   15%  /home
                                 now the 394.1 MB of disk space are available

Figure 4.16. Open a file and then `unlink` it

#include "apue.h"
 #include <fcntl.h>
 
 int
 main(void)
 {
     if (open("tempfile", O_RDWR) < 0)
         err_sys("open error");
     if (unlink("tempfile") < 0)
         err_sys("unlink error");
     printf("file unlinked\n");
     sleep(15);
     printf("done\n");
     exit(0);
 }

This property of unlink is often used by a program to ensure that a temporary file it creates won't be left around in case the program crashes. The process creates a file using either open or creat and then immediately calls unlink. The file is not deleted, however, because it is still open. Only when the process either closes the file or terminates, which causes the kernel to close all its open files, is the file deleted.

If pathname is a symbolic link, unlink removes the symbolic link, not the file referenced by the link. There is no function to remove the file referenced by a symbolic link given the name of the link.

The superuser can call unlink with pathname specifying a directory, but the function rmdir should be used instead to unlink a directory. We describe the rmdir function in Section 4.20.

We can also unlink a file or a directory with the remove function. For a file, remove is identical to unlink. For a directory, remove is identical to rmdir.

#include <stdio.h> int remove(const char *pathname);

Returns: 0 if OK, 1 on error

ISO C specifies the remove function to delete a file. The name was changed from the historical UNIX name of unlink because most non-UNIX systems that implement the C standard didn't support the concept of links to a file at the time.

A file or a directory is renamed with the rename function.

#include <stdio.h> int rename(const char *oldname, const char *newname);

Returns: 0 if OK, 1 on error

This function is defined by ISO C for files. (The C standard doesn't deal with directories.) POSIX.1 expanded the definition to include directories and symbolic links.

There are several conditions to describe, depending on whether oldname refers to a file, a directory, or a symbolic link. We must also describe what happens if newname already exists.

If oldname specifies a file that is not a directory, then we are renaming a file or a symbolic link. In this case, if newname exists, it cannot refer to a directory. If newname exists and is not a directory, it is removed, and oldname is renamed to newname. We must have write permission for the directory containing oldname and for the directory containing newname, since we are changing both directories.
If oldname specifies a directory, then we are renaming a directory. If newname exists, it must refer to a directory, and that directory must be empty. (When we say that a directory is empty, we mean that the only entries in the directory are dot and dot-dot.) If newname exists and is an empty directory, it is removed, and oldname is renamed to newname. Additionally, when we're renaming a directory, newname cannot contain a path prefix that names oldname. For example, we can't rename /usr/foo to /usr/foo/testdir, since the old name (/usr/foo) is a path prefix of the new name and cannot be removed.
If either oldname or newname refers to a symbolic link, then the link itself is processed, not the file to which it resolves.
As a special case, if the oldname and newname refer to the same file, the function returns successfully without changing anything.

If newname already exists, we need permissions as if we were deleting it. Also, because we're removing the directory entry for oldname and possibly creating a directory entry for newname, we need write permission and execute permission in the directory containing oldname and in the directory containing newname.

4.16. Symbolic Links

A symbolic link is an indirect pointer to a file, unlike the hard links from the previous section, which pointed directly to the i-node of the file. Symbolic links were introduced to get around the limitations of hard links.

Hard links normally require that the link and the file reside in the same file system
Only the superuser can create a hard link to a directory

There are no file system limitations on a symbolic link and what it points to, and anyone can create a symbolic link to a directory. Symbolic links are typically used to move a file or an entire directory hierarchy to another location on a system.

Symbolic links were introduced with 4.2BSD and subsequently supported by SVR4.

When using functions that refer to a file by name, we always need to know whether the function follows a symbolic link. If the function follows a symbolic link, a pathname argument to the function refers to the file pointed to by the symbolic link. Otherwise, a pathname argument refers to the link itself, not the file pointed to by the link. Figure 4.17 summarizes whether the functions described in this chapter follow a symbolic link. The functions mkdir, mkfifo, mknod, and rmdir are not in this figure, as they return an error when the pathname is a symbolic link. Also, the functions that take a file descriptor argument, such as fstat and fchmod, are not listed, as the handling of a symbolic link is done by the function that returns the file descriptor (usually open). Whether or not chown follows a symbolic link depends on the implementation.

In older versions of Linux (those before version 2.1.81), chown didn't follow symbolic links. From version 2.1.81 onward, chown follows symbolic links. With FreeBSD 5.2.1 and Mac OS X 10.3, chown follows symbolic links. (Prior to 4.4BSD, chown didn't follow symbolic links, but this was changed in 4.4BSD.) In Solaris 9, chown also follows symbolic links. All of these platforms provide implementations of lchown to change the ownership of symbolic links themselves.

One exception to Figure 4.17 is when the open function is called with both O_CREAT and O_EXCL set. In this case, if the pathname refers to a symbolic link, open will fail with errno set to EEXIST. This behavior is intended to close a security hole so that privileged processes can't be fooled into writing to the wrong files.

Figure 4.17. Treatment of symbolic links by various functions
Function
Does not follow symbolic link
Follows symbolic link
access

•
chdir

•
chmod

•
chown
•
•
creat

•
exec

•
lchown
•

link

•
lstat
•

open

•
opendir

•
pathconf

•
readlink
•

remove
•

rename
•

stat

•
truncate

•
unlink
•

Example

It is possible to introduce loops into the file system by using symbolic links. Most functions that look up a pathname return an errno of ELOOP when this occurs. Consider the following commands:

      $ mkdir foo                   make a new directory
      $ touch foo/a                 create a 0-length file
      $ ln -s ../foo foo/testdir    create a symbolic link
      $ ls -l foo
      total 0
      -rw-r----- 1 sar            0 Jan 22 00:16 a
      lrwxrwxrwx 1 sar            6 Jan 22 00:16 testdir -> ../foo

This creates a directory foo that contains the file a and a symbolic link that points to foo. We show this arrangement in Figure 4.18, drawing a directory as a circle and a file as a square. If we write a simple program that uses the standard function ftw(3) on Solaris to descend through a file hierarchy, printing each pathname encountered, the output is

     foo
     foo/a
     foo/testdir
     foo/testdir/a
     foo/testdir/testdir
     foo/testdir/testdir/a
     foo/testdir/testdir/testdir
     foo/testdir/testdir/testdir/a

(many more lines until we encounter an ELOOP error)

In Section 4.21, we provide our own version of the ftw function that uses lstat instead of stat, to prevent it from following symbolic links.

Note that on Linux, the ftw function uses lstat, so it doesn't display this behavior.

A loop of this form is easy to remove. We are able to unlink the file foo/testdir, as unlink does not follow a symbolic link. But if we create a hard link that forms a loop of this type, its removal is much more difficult. This is why the link function will not form a hard link to a directory unless the process has superuser privileges.

Indeed, Rich Stevens did this on his own system as an experiment while writing the original version of this section. The file system got corrupted and the normal fsck(1) utility couldn't fix things. The deprecated tools clri(8) and dcheck(8) were needed to repair the file system.

The need for hard links to directories has long since passed. With symbolic links and the mkdir function, there is no longer any need for users to create hard links to directories.

When we open a file, if the pathname passed to open specifies a symbolic link, open follows the link to the specified file. If the file pointed to by the symbolic link doesn't exist, open returns an error saying that it can't open the file. This can confuse users who aren't familiar with symbolic links. For example,

      $ ln -s /no/such/file myfile            create a symbolic link
      $ ls myfile
      myfile                                  ls says it's there
      $ cat myfile                            so we try to look at it
      cat: myfile: No such file or directory
      $ ls -l myfile                          try -l option
      lrwxrwxrwx 1 sar        13 Jan 22 00:26 myfile -> /no/such/file

The file myfile does exist, yet cat says there is no such file, because myfile is a symbolic link and the file pointed to by the symbolic link doesn't exist. The -l option to ls gives us two hints: the first character is an l, which means a symbolic link, and the sequence -> also indicates a symbolic link. The ls command has another option (-F) that appends an at-sign to filenames that are symbolic links, which can help spot symbolic links in a directory listing without the -l option.

Figure 4.18. Symbolic link `testdir` that creates a loop

4.17. `symlink` and `readlink` Functions

A symbolic link is created with the symlink function.

[View full width]
#include <unistd.h> int symlink(const char *actualpath, const char *sympath);

Returns: 0 if OK, 1 on error

A new directory entry, sympath, is created that points to actualpath. It is not required that actualpath exist when the symbolic link is created. (We saw this in the example at the end of the previous section.) Also, actualpath and sympath need not reside in the same file system.

Because the open function follows a symbolic link, we need a way to open the link itself and read the name in the link. The readlink function does this.

[View full width]
#include <unistd.h> ssize_t readlink(const char* restrict pathname, char *restrict buf, size_t bufsize);

Returns: number of bytes read if OK, 1 on error

This function combines the actions of open, read, and close. If the function is successful, it returns the number of bytes placed into buf. The contents of the symbolic link that are returned in buf are not null terminated.

4.18. File Times

Three time fields are maintained for each file. Their purpose is summarized in Figure 4.19.

Figure 4.19. The three time values associated with each file
Field
Description
Example
ls(1) option
st_atime
last-access time of file data
read
-u
st_mtime
last-modification time of file data
write
default
st_ctime
last-change time of i-node status
chmod, chown
-c

Note the difference between the modification time (st_mtime) and the changed-status time (st_ctime). The modification time is when the contents of the file were last modified. The changed-status time is when the i-node of the file was last modified. In this chapter, we've described many operations that affect the i-node without changing the actual contents of the file: changing the file access permissions, changing the user ID, changing the number of links, and so on. Because all the information in the i-node is stored separately from the actual contents of the file, we need the changed-status time, in addition to the modification time.

Note that the system does not maintain the last-access time for an i-node. This is why the functions access and stat, for example, don't change any of the three times.

The access time is often used by system administrators to delete files that have not been accessed for a certain amount of time. The classic example is the removal of files named a.out or core that haven't been accessed in the past week. The find(1) command is often used for this type of operation.

The modification time and the changed-status time can be used to archive only those files that have had their contents modified or their i-node modified.

The ls command displays or sorts only on one of the three time values. By default, when invoked with either the -l or the -t option, it uses the modification time of a file. The -u option causes it to use the access time, and the -c option causes it to use the changed-status time.

Figure 4.20 summarizes the effects of the various functions that we've described on these three times. Recall from Section 4.14 that a directory is simply a file containing directory entries: filenames and associated i-node numbers. Adding, deleting, or modifying these directory entries can affect the three times associated with that directory. This is why Figure 4.20 contains one column for the three times associated with the file or directory and another column for the three times associated with the parent directory of the referenced file or directory. For example, creating a new file affects the directory that contains the new file, and it affects the i-node for the new file. Reading or writing a file, however, affects only the i-node of the file and has no effect on the directory. (The mkdir and rmdir functions are covered in Section 4.20. The utime function is covered in the next section. The six exec functions are described in Section 8.10. We describe the mkfifo and pipe functions in Chapter 15.)

Figure 4.20. Effect of various functions on the access, modification, and changed-status times
Function
Referenced file or directory
Parent directory of referenced file or directory
Section
Note
a
m
c
a
m
c
chmod, fchmod

•

4.9

chown, fchown

•

4.11

creat
•
•
•

•
•
3.4
O_CREAT new file
creat

•
•

3.4
O_TRUNC existing file
exec
•

8.10

lchown

•

4.11

link

•

•
•
4.15
parent of second argument
mkdir
•
•
•

•
•
4.20

mkfifo
•
•
•

•
•
15.5

open
•
•
•

•
•
3.3
O_CREAT new file
open

•
•

3.3
O_TRUNC existing file
pipe
•
•
•

15.2

read
•

3.7

remove

•

•
•
4.15
remove file = unlink
remove

•
•
4.15
remove directory = rmdir
rename

•

•
•
4.15
for both arguments
rmdir

•
•
4.20

truncate, ftruncate

•
•

4.13

unlink

•

•
•
4.15

utime
•
•
•

4.19

write

•
•

3.8

4.19. `utime` Function

The access time and the modification time of a file can be changed with the utime function.

[View full width]
#include <utime.h> int utime(const char *pathname, const struct utimbuf *times);

Returns: 0 if OK, 1 on error

The structure used by this function is

     struct utimbuf {
       time_t actime;    /* access time */
       time_t modtime;   /* modification time */
     }

The two time values in the structure are calendar times, which count seconds since the Epoch, as described in Section 1.10.

The operation of this function, and the privileges required to execute it, depend on whether the times argument is NULL.

If times is a null pointer, the access time and the modification time are both set to the current time. To do this, either the effective user ID of the process must equal the owner ID of the file, or the process must have write permission for the file.
If times is a non-null pointer, the access time and the modification time are set to the values in the structure pointed to by times. For this case, the effective user ID of the process must equal the owner ID of the file, or the process must be a superuser process. Merely having write permission for the file is not adequate.

Note that we are unable to specify a value for the changed-status time, st_ctimethe time the i-node was last changedas this field is automatically updated when the utime function is called.

On some versions of the UNIX System, the touch(1) command uses this function. Also, the standard archive programs, tar(1) and cpio(1), optionally call utime to set the times for a file to the time values saved when the file was archived.

Example

The program shown in Figure 4.21 truncates files to zero length using the O_TRUNC option of the open function, but does not change their access time or modification time. To do this, the program first obtains the times with the stat function, truncates the file, and then resets the times with the utime function.

We can demonstrate the program in Figure 4.21 with the following script:

      $ ls -l changemod times           look at sizes and last-modification times
      -rwxrwxr-x 1 sar   15019   Nov  18  18:53  changemod
      -rwxrwxr-x 1 sar   16172   Nov  19  20:05  times
      $ ls -lu changemod times          look at last-access times
      -rwxrwxr-x 1 sar   15019   Nov  18  18:53  changemod
      -rwxrwxr-x 1 sar   16172   Nov  19  20:05  times
      $ date                            print today's date
      Thu Jan 22 06:55:17 EST 2004
      $ ./a.out changemod times         run the program in Figure 4.21
      $ ls -l changemod times           and check the results
      -rwxrwxr-x 1 sar        0  Nov  18  18:53  changemod
      -rwxrwxr-x 1 sar        0  Nov  19  20:05  times
      $ ls -lu changemod times          check the last-access times also
      -rwxrwxr-x 1 sar        0  Nov  18  18:53  changemod
      -rwxrwxr-x 1 sar        0  Nov  19  20:05  times
      $ ls -lc changemod times          and the changed-status times
      -rwxrwxr-x 1 sar        0  Jan  22  06:55  changemod
      -rwxrwxr-x 1 sar        0  Jan  22  06:55  times

As we expect, the last-modification times and the last-access times are not changed. The changed-status times, however, are changed to the time that the program was run.

Figure 4.21. Example of `utime` function

 #include "apue.h"
 #include <fcntl.h>
 #include <utime.h>
 
 int
 main(int argc, char *argv[])
 {
     int             i, fd;
     struct stat     statbuf;
     struct utimbuf  timebuf;
 
     for (i = 1; i < argc; i++) {
         if (stat(argv[i], &statbuf) < 0) { /* fetch current times */
             err_ret("%s: stat error", argv[i]);
             continue;
         }
         if ((fd = open(argv[i], O_RDWR | O_TRUNC)) < 0) { /* truncate */
             err_ret("%s: open error", argv[i]);
             continue;
 
         } 
         close(fd);
         timebuf.actime  =  statbuf.st_atime;
         timebuf.modtime =  statbuf.st_mtime;
         if (utime(argv[i], &timebuf) < 0) {     /* reset times */
             err_ret("%s: utime error", argv[i]);
             continue;
         }
     }
     exit(0);
 }

4.20. `mkdir` and `rmdir` Functions

Directories are created with the mkdir function and deleted with the rmdir function.

#include <sys/stat.h> int mkdir(const char *pathname, mode_t mode);

Returns: 0 if OK, 1 on error

This function creates a new, empty directory. The entries for dot and dot-dot are automatically created. The specified file access permissions, mode, are modified by the file mode creation mask of the process.

A common mistake is to specify the same mode as for a file: read and write permissions only. But for a directory, we normally want at least one of the execute bits enabled, to allow access to filenames within the directory. (See Exercise 4.16.)

The user ID and group ID of the new directory are established according to the rules we described in Section 4.6.

Solaris 9 and Linux 2.4.22 also have the new directory inherit the set-group-ID bit from the parent directory. This is so that files created in the new directory will inherit the group ID of that directory. With Linux, the file system implementation determines whether this is supported. For example, the ext2 and ext3 file systems allow this behavior to be controlled by an option to the mount(1) command. With the Linux implementation of the UFS file system, however, the behavior is not selectable; it inherits the set-group-ID bit to mimic the historical BSD implementation, where the group ID of a directory is inherited from the parent directory.

BSD-based implementations don't propagate the set-group-ID bit; they simply inherit the group ID as a matter of policy. Because FreeBSD 5.2.1 and Mac OS X 10.3 are based on 4.4BSD, they do not require this inheriting of the set-group-ID bit. On these platforms, newly created files and directories always inherit the group ID of the parent directory, regardless of the set-group-ID bit.

Earlier versions of the UNIX System did not have the mkdir function. It was introduced with 4.2BSD and SVR3. In the earlier versions, a process had to call the mknod function to create a new directory. But use of the mknod function was restricted to superuser processes. To circumvent this, the normal command that created a directory, mkdir(1), had to be owned by root with the set-user-ID bit on. To create a directory from a process, the mkdir(1) command had to be invoked with the system(3) function.

An empty directory is deleted with the rmdir function. Recall that an empty directory is one that contains entries only for dot and dot-dot.

#include <unistd.h> int rmdir(const char *pathname);

Returns: 0 if OK, 1 on error

If the link count of the directory becomes 0 with this call, and if no other process has the directory open, then the space occupied by the directory is freed. If one or more processes have the directory open when the link count reaches 0, the last link is removed and the dot and dot-dot entries are removed before this function returns. Additionally, no new files can be created in the directory. The directory is not freed, however, until the last process closes it. (Even though some other process has the directory open, it can't be doing much in the directory, as the directory had to be empty for the rmdir function to succeed.)

4.21. Reading Directories

Directories can be read by anyone who has access permission to read the directory. But only the kernel can write to a directory, to preserve file system sanity. Recall from Section 4.5 that the write permission bits and execute permission bits for a directory determine if we can create new files in the directory and remove files from the directorythey don't specify if we can write to the directory itself.

The actual format of a directory depends on the UNIX System implementation and the design of the file system. Earlier systems, such as Version 7, had a simple structure: each directory entry was 16 bytes, with 14 bytes for the filename and 2 bytes for the i-node number. When longer filenames were added to 4.2BSD, each entry became variable length, which means that any program that reads a directory is now system dependent. To simplify this, a set of directory routines were developed and are part of POSIX.1. Many implementations prevent applications from using the read function to access the contents of directories, thereby further isolating applications from the implementation-specific details of directory formats.

#include <dirent.h> DIR *opendir(const char *pathname);

Returns: pointer if OK, NULL on error

struct dirent *readdir(DIR *dp);

Returns: pointer if OK, NULL at end of directory or error

void rewinddir(DIR *dp); int closedir(DIR *dp);

Returns: 0 if OK, 1 on error

long telldir(DIR *dp);

Returns: current location in directory associated with dp

void seekdir(DIR *dp, long loc);

The telldir and seekdir functions are not part of the base POSIX.1 standard. They are XSI extensions in the Single UNIX Specifications, so all conforming UNIX System implementations are expected to provide them.

Recall our use of several of these functions in the program shown in Figure 1.3, our bare-bones implementation of the ls command.

The dirent structure defined in the file <dirent.h> is implementation dependent. Implementations define the structure to contain at least the following two members:

       struct dirent {
         ino_t d_ino;                  /* i-node number */
         char  d_name[NAME_MAX + 1];   /* null-terminated filename */
       }

The d_ino enTRy is not defined by POSIX.1, since it's an implementation feature, but it is defined in the XSI extension to POSIX.1. POSIX.1 defines only the d_name entry in this structure.

Note that NAME_MAX is not a defined constant with Solarisits value depends on the file system in which the directory resides, and its value is usually obtained from the fpathconf function. A common value for NAME_MAX is 255. (Recall Figure 2.14.) Since the filename is null terminated, however, it doesn't matter how the array d_name is defined in the header, because the array size doesn't indicate the length of the filename.

The DIR structure is an internal structure used by these six functions to maintain information about the directory being read. The purpose of the DIR structure is similar to that of the FILE structure maintained by the standard I/O library, which we describe in Chapter 5.

The pointer to a DIR structure that is returned by opendir is then used with the other five functions. The opendir function initializes things so that the first readdir reads the first entry in the directory. The ordering of entries within the directory is implementation dependent and is usually not alphabetical.

Example

We'll use these directory routines to write a program that traverses a file hierarchy. The goal is to produce the count of the various types of files that we show in Figure 4.4. The program shown in Figure 4.22 takes a single argumentthe starting pathnameand recursively descends the hierarchy from that point. Solaris provides a function, ftw(3), that performs the actual traversal of the hierarchy, calling a user-defined function for each file. The problem with this function is that it calls the stat function for each file, which causes the program to follow symbolic links. For example, if we start at the root and have a symbolic link named /lib that points to /usr/lib, all the files in the directory /usr/lib are counted twice. To correct this, Solaris provides an additional function, nftw(3), with an option that stops it from following symbolic links. Although we could use nftw, we'll write our own simple file walker to show the use of the directory routines.

In the Single UNIX Specification, both ftw and nftw are included in the XSI extensions to the base POSIX.1 specification. Implementations are included in Solaris 9 and Linux 2.4.22. BSD-based systems have a different function, fts(3), that provides similar functionality. It is available in FreeBSD 5.2.1, Mac OS X 10.3, and Linux 2.4.22.

We have provided more generality in this program than needed. This was done to illustrate the ftw function. For example, the function myfunc always returns 0, even though the function that calls it is prepared to handle a nonzero return.

Figure 4.22. Recursively descend a directory hierarchy, counting file types

 #include "apue.h"
 #include <dirent.h>
 #include <limits.h>
 
 /* function type that is called for each filename */
 typedef int Myfunc(const char *, const struct stat *, int);
 
 static Myfunc     myfunc;
 static int        myftw(char *, Myfunc *);
 static int        dopath(Myfunc *);
 
 static long nreg, ndir, nblk, nchr, nfifo, nslink, nsock, ntot;
 
 int
 main(int argc, char *argv[])
 {
     int     ret;
 
     if (argc != 2)
         err_quit("usage: ftw <starting-pathname>");
 
     ret = myftw(argv[1], myfunc);        /* does it all */
 
     ntot = nreg + ndir + nblk + nchr + nfifo + nslink + nsock;
     if (ntot == 0)
         ntot = 1;       /* avoid divide by 0; print 0 for all counts */
     printf("regular files  = %7ld, %5.2f %%\n", nreg,
       nreg*100.0/ntot);
     printf("directories    = %7ld, %5.2f %%\n", ndir,
       ndir*100.0/ntot);
     printf("block special  = %7ld, %5.2f %%\n", nblk,
       nblk*100.0/ntot);
     printf("char special   = %7ld, %5.2f %%\n", nchr,
       nchr*100.0/ntot);
     printf("FIFOs          = %7ld, %5.2f %%\n", nfifo,
       nfifo*100.0/ntot);
     printf("symbolic links = %7ld, %5.2f %%\n", nslink,
       nslink*100.0/ntot);
     printf("sockets        = %7ld, %5.2f %%\n", nsock,
       nsock*100.0/ntot);
 
     exit(ret);
 }
 
 /*
  * Descend through the hierarchy, starting at "pathname".
  * The caller's func() is called for every file.
  */
 #define FTW_F   1       /* file other than directory */
 #define FTW_D   2       /* directory */
 #define FTW_DNR 3       /* directory that can't be read */
 #define FTW_NS  4       /* file that we can't stat */
 
 static char *fullpath;      /* contains full pathname for every file */
 
 static int                  /* we return whatever func() returns */
 myftw(char *pathname, Myfunc *func)
 {
 
     int len;
     fullpath = path_alloc(&len);    /* malloc's for PATH_MAX+1 bytes */
                                         /* (Figure 2.15) */
     strncpy(fullpath, pathname, len);       /* protect against */
     fullpath[len-1] = 0;                    /* buffer overrun */
 
     return(dopath(func));
 }
 /*
  * Descend through the hierarchy, starting at "fullpath".
  * If "fullpath" is anything other than a directory, we lstat() it,
  * call func(), and return. For a directory, we call ourself
  * recursively for each name in the directory.
  */
 static int                  /* we return whatever func() returns */
 dopath(Myfunc* func)
 {
     struct stat     statbuf;
     struct dirent   *dirp;
     DIR             *dp;
     int             ret;
     char            *ptr;
 
     if (lstat(fullpath, &statbuf) < 0) /* stat error */
         return(func(fullpath, &statbuf, FTW_NS));
     if (S_ISDIR(statbuf.st_mode) == 0) /* not a directory */
         return(func(fullpath, &statbuf, FTW_F));
 
      /*
       * It's a directory. First call func() for the directory,
       * then process each filename in the directory.
       */
     if ((ret = func(fullpath, &statbuf, FTW_D)) != 0)
         return(ret);
 
     ptr = fullpath + strlen(fullpath);      /* point to end of fullpath */
     *ptr++ = '/';
     *ptr = 0;
 
      if ((dp = opendir(fullpath)) == NULL)     /* can't read directory */
          return(func(fullpath, &statbuf, FTW_DNR));
 
      while ((dirp = readdir(dp)) != NULL) {
          if (strcmp(dirp->d_name, ".") == 0 ||
              strcmp(dirp->d_name, "..") == 0)
                  continue;        /* ignore dot and dot-dot */
 
          strcpy(ptr, dirp->d_name);   /* append name after slash */
 
          if ((ret = dopath(func)) != 0)          /* recursive */
               break; /* time to leave */
      }
      ptr[-1] = 0;    /* erase everything from slash onwards */
 
      if (closedir(dp) < 0)
          err_ret("can't close directory %s", fullpath);
 
      return(ret);
 }
 
 static int
 myfunc(const char *pathname, const struct stat *statptr, int type)
 {
     switch (type) {
     case FTW_F:
         switch (statptr->st_mode & S_IFMT) {
         case S_IFREG:    nreg++;    break;
         case S_IFBLK:    nblk++;    break;
         case S_IFCHR:    nchr++;    break;
         case S_IFIFO:    nfifo++;   break;
         case S_IFLNK:    nslink++;  break;
         case S_IFSOCK:   nsock++;   break;
         case S_IFDIR:
             err_dump("for S_IFDIR for %s", pathname);
                     /* directories should have type = FTW_D */
         }
         break;
 
     case FTW_D:
         ndir++;
         break;
 
     case FTW_DNR:
         err_ret("can't read directory %s", pathname);
         break;
 
     case FTW_NS:
         err_ret("stat error for %s", pathname);
         break;
 
     default:
         err_dump("unknown type %d for pathname %s", type, pathname);
     }
 
     return(0);
 }

For additional information on descending through a file system and the use of this technique in many standard UNIX System commandsfind, ls, tar, and so onrefer to Fowler, Korn, and Vo [1989].

4.22. `chdir`, `fchdir`, and `getcwd` Functions

Every process has a current working directory. This directory is where the search for all relative pathnames starts (all pathnames that do not begin with a slash). When a user logs in to a UNIX system, the current working directory normally starts at the directory specified by the sixth field in the /etc/passwd filethe user's home directory. The current working directory is an attribute of a process; the home directory is an attribute of a login name.

We can change the current working directory of the calling process by calling the chdir or fchdir functions.

#include <unistd.h> int chdir(const char *pathname); int fchdir(int filedes);

Both return: 0 if OK, 1 on error

We can specify the new current working directory either as a pathname or through an open file descriptor.

The fchdir function is not part of the base POSIX.1 specification. It is an XSI extension in the Single UNIX Specification. All four platforms discussed in this book support fchdir.

Example

Because it is an attribute of a process, the current working directory cannot affect processes that invoke the process that executes the chdir. (We describe the relationship between processes in more detail in Chapter 8.) This means that the program in Figure 4.23 doesn't do what we might expect.

If we compile it and call the executable mycd, we get the following:

     $ pwd
     /usr/lib
     $ mycd
     chdir to /tmp succeeded
     $ pwd
     /usr/lib

The current working directory for the shell that executed the mycd program didn't change. This is a side effect of the way that the shell executes programs. Each program is run in a separate process, so the current working directory of the shell is unaffected by the call to chdir in the program. For this reason, the chdir function has to be called directly from the shell, so the cd command is built into the shells.

Figure 4.23. Example of `chdir` function

 #include "apue.h"
 
 int
 main(void)
 {
 
      if (chdir("/tmp") < 0)
          err_sys("chdir failed");
      printf("chdir to /tmp succeeded\n");
      exit(0);
 }

Because the kernel must maintain knowledge of the current working directory, we should be able to fetch its current value. Unfortunately, the kernel doesn't maintain the full pathname of the directory. Instead, the kernel keeps information about the directory, such as a pointer to the directory's v-node.

What we need is a function that starts at the current working directory (dot) and works its way up the directory hierarchy, using dot-dot to move up one level. At each directory, the function reads the directory entries until it finds the name that corresponds to the i-node of the directory that it just came from. Repeating this procedure until the root is encountered yields the entire absolute pathname of the current working directory. Fortunately, a function is already provided for us that does this task.

#include <unistd.h> char *getcwd(char *buf, size_t size);

Returns: buf if OK, NULL on error

We must pass to this function the address of a buffer, buf, and its size (in bytes). The buffer must be large enough to accommodate the absolute pathname plus a terminating null byte, or an error is returned. (Recall the discussion of allocating space for a maximum-sized pathname in Section 2.5.5.)

Some older implementations of getcwd allow the first argument buf to be NULL. In this case, the function calls malloc to allocate size number of bytes dynamically. This is not part of POSIX.1 or the Single UNIX Specification and should be avoided.

Example

The program in Figure 4.24 changes to a specific directory and then calls getcwd to print the working directory. If we run the program, we get

     $ ./a.out
     cwd = /var/spool/uucppublic
     $ ls -l /usr/spool
     lrwxrwxrwx 1 root 12 Jan 31 07:57 /usr/spool -> ../var/spool

Note that chdir follows the symbolic linkas we expect it to, from Figure 4.17but when it goes up the directory tree, getcwd has no idea when it hits the /var/spool directory that it is pointed to by the symbolic link /usr/spool. This is a characteristic of symbolic links.

Figure 4.24. Example of `getcwd` function

   #include "apue.h"
 
   int
   main(void)
   {
 
       char    *ptr;
       int     size;
 
       if (chdir("/usr/spool/uucppublic") < 0)
           err_sys("chdir failed");
 
       ptr = path_alloc(&size); /* our own function */
       if (getcwd(ptr, size) == NULL)
           err_sys("getcwd failed");
 
       printf("cwd = %s\n", ptr);
       exit(0);
   }

The getcwd function is useful when we have an application that needs to return to the location in the file system where it started out. We can save the starting location by calling getcwd before we change our working directory. After we complete our processing, we can pass the pathname obtained from getcwd to chdir to return to our starting location in the file system.

The fchdir function provides us with an easy way to accomplish this task. Instead of calling getcwd, we can open the current directory and save the file descriptor before we change to a different location in the file system. When we want to return to where we started, we can simply pass the file descriptor to fchdir.

4.23. Device Special Files

The two fields st_dev and st_rdev are often confused. We'll need to use these fields in Section 18.9 when we write the ttyname function. The rules are simple.

Every file system is known by its major and minor device numbers, which are encoded in the primitive system data type dev_t. The major number identifies the device driver and sometimes encodes which peripheral board to communicate with; the minor number identifies the specific subdevice. Recall from Figure 4.13 that a disk drive often contains several file systems. Each file system on the same disk drive would usually have the same major number, but a different minor number.
We can usually access the major and minor device numbers through two macros defined by most implementations: major and minor. This means that we don't care how the two numbers are stored in a dev_t object.

Early systems stored the device number in a 16-bit integer, with 8 bits for the major number and 8 bits for the minor number. FreeBSD 5.2.1 and Mac OS X 10.3 use a 32-bit integer, with 8 bits for the major number and 24 bits for the minor number. On 32-bit systems, Solaris 9 uses a 32-bit integer for dev_t, with 14 bits designated as the major number and 18 bits designated as the minor number. On 64-bit systems, Solaris 9 represents dev_t as a 64-bit integer, with 32 bits for each number. On Linux 2.4.22, although dev_t is a 64-bit integer, currently the major and minor numbers are each only 8 bits.

POSIX.1 states that the dev_t type exists, but doesn't define what it contains or how to get at its contents. The macros major and minor are defined by most implementations. Which header they are defined in depends on the system. They can be found in <sys/types.h> on BSD-based systems. Solaris defines them in <sys/mkdev.h>. Linux defines these macros in <sys/sysmacros.h>, which is included by <sys/types.h>.
The st_dev value for every filename on a system is the device number of the file system containing that filename and its corresponding i-node.
Only character special files and block special files have an st_rdev value. This value contains the device number for the actual device.

Example

The program in Figure 4.25 prints the device number for each command-line argument. Additionally, if the argument refers to a character special file or a block special file, the st_rdev value for the special file is also printed.

Running this program gives us the following output:

       $ ./a.out / /home/sar /dev/tty[01]
       /: dev = 3/3
       /home/sar: dev = 3/4
       /dev/tty0: dev = 0/7 (character) rdev = 4/0
       /dev/tty1: dev = 0/7 (character) rdev = 4/1
       $ mount                      which directories are mounted on which devices?
       /dev/hda3 on / type ext2 (rw,noatime)
       /dev/hda4 on /home type ext2 (rw,noatime)
       $ ls -lL /dev/tty[01] /dev/hda[34]
       brw-------  1 root       3,   3 Dec 31  1969 /dev/hda3
       brw-------  1 root       3,   4 Dec 31  1969 /dev/hda4
       crw-------  1 root       4,   0 Dec 31  1969 /dev/tty0
       crw-------  1 root       4,   1 Jan 18 15:36 /dev/tty1

The first two arguments to the program are directories (/ and /home/sar), and the next two are the device names /dev/tty[01]. (We use the shell's regular expression language to shorten the amount of typing we need to do. The shell will expand the string /dev/tty[01] to /dev/tty0 /dev/tty1.)

We expect the devices to be character special files. The output from the program shows that the root directory has a different device number than does the /home/sar directory. This indicates that they are on different file systems. Running the mount(1) command verifies this.

We then use ls to look at the two disk devices reported by mount and the two terminal devices. The two disk devices are block special files, and the two terminal devices are character special files. (Normally, the only types of devices that are block special files are those that can contain random-access file systems: disk drives, floppy disk drives, and CD-ROMs, for example. Some older versions of the UNIX System supported magnetic tapes for file systems, but this was never widely used.)

Note that the filenames and i-nodes for the two terminal devices (st_dev) are on device 0/7the devfs pseudo file system, which implements the /devbut that their actual device numbers are 4/0 and 4/1.

Figure 4.25. Print `st_dev` and `st_rdev` values

 #include "apue.h"
 #ifdef SOLARIS
 #include <sys/mkdev.h>
 #endif
 
 int
 main(int argc, char *argv[])
 {
 
     int         i;
     struct stat buf;
 
     for (i = 1; i < argc; i++) {
         printf("%s: ", argv[i]);
         if (stat(argv[i], &buf) < 0) {
             err_ret("stat error");
             continue;
          }
 
          printf("dev = %d/%d", major(buf.st_dev), minor(buf.st_dev));
          if (S_ISCHR(buf.st_mode) || S_ISBLK(buf.st_mode)) {
              printf(" (%s) rdev = %d/%d",
                      (S_ISCHR(buf.st_mode)) ? "character" : "block",
                      major(buf.st_rdev), minor(buf.st_rdev));
 
          }
          printf("\n");
     }
 
     exit(0);
 
 }

4.24. Summary of File Access Permission Bits

We've covered all the file access permission bits, some of which serve multiple purposes. Figure 4.26 summarizes all these permission bits and their interpretation when applied to a regular file and a directory.

Figure 4.26. Summary of file access permission bits
Constant
Description
Effect on regular file
Effect on directory
S_ISUID
set-user-ID
set effective user ID on execution
(not used)
S_ISGID
set-group-ID
if group-execute set then set effective group ID on execution; otherwise enable mandatory record locking (if supported)
set group ID of new files created in directory to group ID of directory
S_ISVTX
sticky bit
control caching of file contents (if supported)
restrict removal and renaming of files in directory
S_IRUSR
user-read
user permission to read file
user permission to read directory entries
S_IWUSR
user-write
user permission to write file
user permission to remove and create files in directory
S_IXUSR
user-execute
user permission to execute file
user permission to search for given pathname in directory
S_IRGRP
group-read
group permission to read file
group permission to read directory entries
S_IWGRP
group-write
group permission to write file
group permission to remove and create files in directory
S_IXGRP
group-execute
group permission to execute file
group permission to search for given pathname in directory
S_IROTH
other-read
other permission to read file
other permission to read directory entries
S_IWOTH
other-write
other permission to write file
other permission to remove and create files in directory
S_IXOTH
other-execute
other permission to execute file
other permission to search for given pathname in directory

The final nine constants can also be grouped into threes, since

       S_IRWXU = S_IRUSR | S_IWUSR | S_IXUSR
       S_IRWXG = S_IRGRP | S_IWGRP | S_IXGRP
       S_IRWXO = S_IROTH | S_IWOTH | S_IXOTH

Оставьте свой комментарий !

Ваше имя:

Комментарий:

Оба поля являются обязательными

Автор	Комментарий к данной статье

Введение

Файловые дескрипторы

Функция open

Filename and Pathname Truncation

3.4. creat Function

3.5. close Function

3.6. lseek Function

Example

Figure 3.1. Test whether standard input is capable of seeking

Example

Figure 3.2. Create a file with a hole in it

Figure 3.3. Data size options and name arguments to sysconf

3.7. read Function

3.8. write Function

3.9. I/O Efficiency

Figure 3.4. Copy standard input to standard output

Figure 3.5. Timing results for reading with different buffer sizes on Linux

3.10. File Sharing

Figure 3.6. Kernel data structures for open files

Figure 3.7. Two independent processes with the same file open

3.11. Atomic Operations

Appending to a File

pread and pwrite Functions

Creating a File

3.12. dup and dup2 Functions

Figure 3.8. Kernel data structures after dup(1)

3.13. sync, fsync, and fdatasync Functions

3.14. fcntl Function

Figure 3.9. File status flags for fcntl

Example

Figure 3.10. Print file flags for specified descriptor

Example

Figure 3.11. Turn on one or more of the file status flags for a descriptor

Figure 3.12. Linux ext2 timing results using various synchronization mechanisms

Figure 3.13. Mac OS X timing results using various synchronization mechanisms

3.15. ioctl Function

Figure 3.14. Common FreeBSD ioctl operations

3.16. /dev/fd

3.17. Summary

4.1. Introduction

4.2. stat, fstat, and lstat Functions

4.3. File Types

Figure 4.1. File type macros in <sys/stat.h>

Figure 4.2. IPC type macros in <sys/stat.h>

Example

Figure 4.3. Print type of file for each command-line argument

Figure 4.4. Counts and percentages of different file types

4.4. Set-User-ID and Set-Group-ID

Figure 4.5. User IDs and group IDs associated with each process

4.5. File Access Permissions

Figure 4.6. The nine file access permission bits, from <sys/stat.h>

4.6. Ownership of New Files and Directories

4.7. access Function

Figure 4.7. The mode constants for access function, from <unistd.h>

Example

Figure 4.8. Example of access function

4.8. umask Function

Example

Figure 4.9. Example of umask function

Figure 4.10. The umask file access permission bits

4.9. chmod and fchmod Functions

Figure 4.11. The mode constants for chmod functions, from <sys/stat.h>

Example

Figure 4.12. Example of chmod function

4.10. Sticky Bit

4.11. chown, fchown, and lchown Functions

4.12. File Size

Holes in a File

4.13. File Truncation

4.14. File Systems

Figure 4.13. Disk drive, partitions, and a file system

Figure 4.14. Cylinder group's i-nodes and data blocks in more detail

Figure 4.15. Sample cylinder group after creating the directory testdir

4.15. link, unlink, remove, and rename Functions

Example

Figure 4.16. Open a file and then unlink it

4.16. Symbolic Links

Figure 4.17. Treatment of symbolic links by various functions

Example

Figure 4.18. Symbolic link testdir that creates a loop

Функция `open`

3.4. `creat` Function

3.5. `close` Function

3.6. `lseek` Function

Figure 3.3. Data size options and name arguments to `sysconf`

3.7. `read` Function

3.8. `write` Function

`pread` and `pwrite` Functions

3.12. `dup` and `dup2` Functions

Figure 3.8. Kernel data structures after `dup`(1)

3.13. `sync`, `fsync`, and `fdatasync` Functions

3.14. `fcntl` Function

Figure 3.9. File status flags for `fcntl`

Figure 3.12. Linux `ext2` timing results using various synchronization mechanisms

3.15. `ioctl` Function

Figure 3.14. Common FreeBSD `ioctl` operations

3.16. `/dev/fd`

4.2. `stat`, `fstat`, and `lstat` Functions

Figure 4.1. File type macros in `<sys/stat.h>`

Figure 4.2. IPC type macros in `<sys/stat.h>`

Figure 4.6. The nine file access permission bits, from `<sys/stat.h>`

4.7. `access` Function

Figure 4.7. The mode constants for `access` function, from `<unistd.h>`

Figure 4.8. Example of `access` function

4.8. `umask` Function

Figure 4.9. Example of `umask` function

Figure 4.10. The `umask` file access permission bits

4.9. `chmod` and `fchmod` Functions

Figure 4.11. The mode constants for `chmod` functions, from `<sys/stat.h>`

Figure 4.12. Example of `chmod` function

4.11. `chown`, `fchown`, and `lchown` Functions

Figure 4.15. Sample cylinder group after creating the directory `testdir`

4.15. `link`, `unlink`, `remove`, and `rename` Functions

Figure 4.16. Open a file and then `unlink` it

Figure 4.18. Symbolic link `testdir` that creates a loop

4.17. `symlink` and `readlink` Functions

4.19. `utime` Function

Figure 4.21. Example of `utime` function

4.20. `mkdir` and `rmdir` Functions

4.22. `chdir`, `fchdir`, and `getcwd` Functions

Figure 4.23. Example of `chdir` function

Figure 4.24. Example of `getcwd` function

Figure 4.25. Print `st_dev` and `st_rdev` values