Search     or:     and:
 LINUX 
 Language 
 Kernel 
 Package 
 Book 
 Test 
 OS 
 Forum 
iakovlev.org

Steve Pate : Файловая система UNIX : эволюция , разработка , реализация

Эта книга описывает файловую систему,присущую всем версиям юникса-линукса. Автор раскрывает аспекты программирования file I/O, описывает внутренности различных версий юникса, такие популярные файловые системы , как UFS, ext2, VERITAS,VxFS. Книга включает примеры,с которыми вы можете экспериментировать.

Файловые концепции

Для получения полной картины о файловой системе нужно понять главные концепции.

Эта глава обьясняет основные концепции. Начинающие программисты юникса найдут здесь много полезного. Подробно будет рассмотрена реализация известной утилиты ls , связанные с ней библиотеки и системные вызовы.

Как известно , в UNIX буквально все представляет из себя файл , и все операции сводятся к операциям файлового типа. Открыть и прочитать директорию можно в том же порядке , что и открыть и прочитать файл.

Типы файлов UNIX

Есть 2 основных типа - это регулярные файлы и директории. К регулярным файлам относятся файлы текстового формата , документы , исполняемый файлы,и т.д.

Каталоги предназначены для организации файловой системы в иерархическую структуру.

Существуют и другие типы файлов :

Regular files. Такие файлы хранят данные различного типа , и такие файлы никак особо не интерпретируются файловой системой.

Directories. Придают структурированность файловой системе. Каталоги могут индексировать входящие в них файлы в произвольном порядке.

Symbolic links. Символическая ссылка , называемая также симлинком - symlink - означает , что один файл может ссылаться на другой файл с другим именем. Удаление симлинка никак не влияет на ссылаемый файл.

Hard links. Такой линк отличается от симлинка тем , что имеет счетчик , который увеличивается каждый раз на единицу при его создании. При каждом удалении такого линка счетчик уменьшается на единицу. Когда счетчик становится равным нулю , источник ссылки удаляется.

Named pipes. Именованный канал - это дву-направленный IPC (Inter Process Communication) механизм , который связывает 2 процесса. Отличается от обычных UNIX pipes тем , что доступ к ним имеет ограничение.

Special files. Специальный файл ссылается на устройство типа диска. Для доступа к такому устройству нужно открыть специальный файл.

Xenix special file. Семафоры и расшаренные сегменты памяти в операционной системе Xenix могут управляться из UNIX. Специальный файл нулевой длины может быть представлен как семафор или сегмент памяти.

Для получения свойств файла любого типа может быть вызван системный вызов stat(). Он неявно вызывается в утилите ls .

Файловые дескрипторы

Рассмотрим несколько примеров на С. Пример :
 #include < sys/types.h>
 #include < sys/stat.h>
 #include < fcntl.h>
 
 main(
 int fd;
 
  fd = open("/etc/passwd", O_RDONLY)
 printf("fd = %d\n", fd)
 close(fd)
 
 
 $ make open
 cc open.c -o open
 $ ./open
 fd = 3
 
Файл сначала надо открыть - open(). Для этого надо включить 3 хидера:
 int open(const char *path, int oflag, ...)
 
 DESCRIPTION
 The open() function establishes the connection between 
 file and a file descriptor. It creates an ..
 
 
Если все нормально , то мы получаем файловый дескриптор, который понадобится в других системных вызовах - read(), write(), lseek(). Для этого введен идентификатор fd.

Базовые свойства файла

Набрав команду : ls -l , увидит подробно следующие свойства файлов:

 Файловый тип и доступ
 Число линков на этот файл
 Владельца и группу
 Размер файла
 Дата последней модификации
 Имя файла
 
Команда ls выполняет следующее :
 1. Выводит файлы текущей директории
 2. Для каждого файла выводит его свойства
 
Ниже показан пример вывода команды ls. Для каждого файла будут вызваны системные вызовы getdents() и stat() :
 #include < sys/types.h>
 #include < sys/stat.h>
 
 int stat(const char *path, struct stat *buf)
 
 - -regular file 
 d -directory 
 s -symbolic link 
 p -named pipe 
 c -character special 
 b -block special file name 
 
Для системного вызова stat() есть своя структура :
 
 struct stat {
 	dev_t st_dev; /* ID of device containing file */ 
 	ino_t st_ino; /* Inode number / file serial number */ 
 	mode_t st_mode; /* File mode */ 
 	nlink_t st_nlink; /* Number of links to file */ 
 	uid_t st_uid; /* User ID of file */ 
 	gid_t st_gid; /* Group ID of file */ 
 	dev_t st_rdev; /* Device ID for char/blk special file */ 
 	off_t st_size; /* File size in bytes (regular file) */ 
 	time_t st_atime; /* Time of last access */ 
 	time_t st_mtime; /* Time of last data modification */ 
 	time_t st_ctime; /* Time of last status change */ 
 	long st_blksize; /* Preferred I/O block size */ 
 	blkcnt_t st_blocks; /* Number of 512 byte blocks allocated */ 
 }; 
 
Ниже дан пример реализации команды ls :
 1 #include < sys/types.h>
 2 #include < sys/stat.h>
 3 #include < sys/dirent.h>
 4 #include < sys/unistd.h>
 5 #include < fcntl.h>
 6 #include < unistd.h>
 7 #include < errno.h>
 8 #include < pwd.h>
 9 #include < grp.h>
 10
 11 #define BUFSZ 1024
 12
 13 main(
 14 { 
 15 struct dirent *dir; 
 16 struct stat st; 
 17 struct passwd *pw; 
 18 struct group *grp; 
 19 char buf[BUFSZ], *bp, *ftime; 
 20 int dfd, fd, nread; 
 21 
 22 dfd = open(".", O_RDONLY)
 23 bzero(buf, BUFSZ)
 24 while (nread = getdents(dfd, (struct dirent *)&buf,
 25 BUFSZ) != 0) 
 26 bp = buf;
 27 dir = (struct dirent *)buf;
 28 do 
 29 if (dir->d_reclen != 0) 
 30 stat(dir->d_name, &st)
 31 ftime = ctime(&st.st_mtime)
 32 ftime[16] = '\0'; ftime += 4;
 33 pw = getpwuid(st.st_uid)
 34 grp = getgrgid(st.st_gid)
 35 perms(st.st_mode)
 36 printf("%3d %-8s %-7s %9d %s %s\n"
 37 st.st_nlink, pw->pw_name, grp->gr_name,
 38 st.st_size, ftime, dir->d_name)
 39 
 40 bp = bp + dir->d_reclen;
 41 dir = (struct dirent *)(bp)
 42 } while (dir->d_ino != 0)
 43 bzero(buf, BUFSZ)
 44 
 45 
 

В цикле системный вызов getdents() будет вызван столько раз , сколько там окажется файлов. Программу вообще-то надо-бы потестировать для большого количества файлов.

В дополнение к системному вызову stat() есть еще два , которые дают аналогичный результат:

 #include < sys/types.h>
 #include < sys/stat.h>
 
 int lstat(const char *path, struct stat *buf)
 
 int fstat(int fildes, struct stat *buf)
 
Разница между stat() и lstat() в том , что они по-разному интерпретируют симлинки.

Маска создания файла

Рассмотрим пример создания файла с помощью команды touch:
 $ touch myfile
 $ ls -l myfile
 -rw-r--r-1 spate fcf 0 Feb 16 11:14 myfile
 
Будет создан файл нулевой длины

К файлу будут привязаны id-шники пользователя и группы пользователей. Он доступен на чение-запись (rw-) владельцем и членами группыg fcf.

Если вы хотите изменить свойства файла , для этого есть команда umask.

Маску файла можно вывести в числовой или символьной форме :

 $ umask
 022
 $ umask -
 u=rwx,g=rx,o=rx
 
Для изменения маски команда umask может быть вызвана с тремя числовыми параметрами, которые представляют пользователя,группу и владельца Каждому можно дать доступ на чтение (r=4), запись (w=2), или выполнение (x=1).

Маска по умолчанию для вновь созданного файла как правтло = 022 при создании с помощью touch:

 $ umask
 022
 $ strace touch myfile 2>&1 | grep open | grep myfile
 open("myfile"
 O_WRONLY_O_NONBLOCK_O_CREAT_O_NOCTTY_O_LARGEFILE, 0666) = 
 $ ls -l myfile
 -rw-r--r-1 spate fcf 0 Apr 4 09:45 myfile
 
022 говорит о том , что доступ на запись запрещен. Файл создается с маской 666. Результат : 666 -022 = 644, который дает права -rw-r--r--.

Изменение атрибутов файла

Есть несколько команд , которые позволяют это сделать. Наиболее известная утилита - chmod:
 chmod [ -fR ]  file ... 
 chmod [ -fR ]  file ... 
 
Маска rwxr--r-- эквивалентна 744. Для этого chmod нужно вызвать со следующими аргументами:
 $ ls -l myfile
 -rw------1 spate fcf 0 Mar 6 10:09 myfile
 $ chmod 744 myfile
 $ ls -l myfile
 -rwxr--r-1 spate fcf 0 Mar 6 10:09 myfile*
 
Или так:
 
 $ ls -l myfile
 -rw------1 spate fcf 0 Mar 6 10:09 myfile
 $ chmod u+x,a+r myfile
 $ ls -l myfile
 -rwxr--r-1 spate fcf 0 Mar 6 10:09 myfile*
 
Атрибуты файла можно менять в этой команде с параметрами u, g, o, a. Добавить атрибут - (+), удалить - (-), или установить (=) :
 $ ls -l myfile
 -rw------1 spate fcf 0 Mar 6 10:09 myfile
 $ chmod u=rwx,g=r,o=r myfile
 $ ls -l myfile
 -rwxr--r-1 spate fcf 0 Mar 6 10:09 myfile*
 
Опцию -R можно использовать рекурсивно для директории :
 $ ls -ld mydir
 drwxr-xr-x 2 spate fcf 4096 Mar 30 11:06 mydir/
 $ ls -l mydir
 total 
 -rw-r--r-1 spate fcf 0 Mar 30 11:06 fileA
 -rw-r--r-1 spate fcf 0 Mar 30 11:06 fileB
 $ chmod -R a+w mydir
 $ ls -ld mydir
 drwxrwxrwx 2 spate fcf 4096 Mar 30 11:06 mydir/
 $ ls -l mydir
 total 
 -rw-rw-rw 1 spate fcf 0 Mar 30 11:06 fileA
 -rw-rw-rw 1 spate fcf 0 Mar 30 11:06 fileB
 
Пример :
 $ find mydir -print | xargs chmod a+
 
Есть разновидности команды chmod :
 #include < sys/types.h>
 #include < sys/stat.h>
 
 int chmod(const char *path, mode_t mode)
 
 
 int fchmod(int fildes, mode_t mode)
 
The mode argument is a bitwise OR of the fields shown in Table 2.1. Some of the flags can be combined as shown below:
 
 	S_IRWXU. This is the bitwise OR of S_IRUSR, S_IWUSR and S_IXUSR 
 	S_IRWXG. This is the bitwise OR of S_IRGRP, S_IWGRPand S_IXGRP 
 	S_IRWXO. This is the bitwise OR of S_IROTH, S_IWOTH and S_IXOTH 
 

Изменение владельца файла

When a file is created, the user and group IDs are set to those of the caller. Occasionally it is useful to change ownership of a file or change the group in which the file resides. Only the root user can change the ownership of a file although any user can change the files group ID to another group in which the user resides.

There are three calls that can be used to change the files user and group as shown below:

 #include < sys/types.h>
 #include < unistd.h>
 
 int chown(const char *path, uid_t owner, gid_t group)
 int fchown(int fd, uid_t owner, gid_t group)
 int lchown(const char *path, uid_t owner, gid_t group)
 
The difference between chown() and lchown() is that the lchown() system call operates on the symbolic link specified rather than the file to which it points.
 
 
 
 PERMISSION DESCRIPTION 
 Table 2.1 Permissions Passed to chmod() 
 
 S_IRWXU Read, write, execute/search by owner 
 S_IRUSR Read permission by owner 
 S_IWUSR Write permission by owner 
 S_IXUSR Execute/search permission by owner 
 S_IRWXG Read, write, execute/search by group 
 S_IRGRP Read permission by group 
 S_IWGRP Write permission by group 
 S_IXGRP Execute/search permission by group 
 S_IRWXO Read, write, execute/search by others 
 S_IROTH Read permission by others 
 S_IWOTH Write permission by others 
 S_IXOTH Execute/search permission by others 
 S_ISUID Set-user-ID on execution 
 S_ISGID Set-group-ID on execution 
 S_ISVTX On directories, set the restricted deletion flag 
 
In addition to setting the user and group IDs of the file, it is also possible to set the effective user and effective group IDs such that if the file is executed, the caller effectively becomes the owner of the file for the duration of execution. This is a commonly used feature in UNIX. For example, the passwd command is a setuid binary. When the command is executed it must gain an effective user ID of root in order to change the passwd(F) file. For example:
 
 $ ls -l /etc/passwd
 -r--r--r-1 root other 157670 Mar 14 16:03 /etc/passwd
 $ ls -l /usr/bin/passwd
 -r-sr-sr-x 3 root sys 99640 Oct 6 1998 /usr/bin/passwd*
 
Because the passwd file is not writable by others, changing it requires that the passwd command run as root as noted by the s shown above. When run, the process runs as root allowing the passwd file to be changed.

The setuid() and setgid() system calls enable the user and group IDs to be changed. Similarly, the seteuid() and setegid() system calls enable the effective user and effective group ID to be changed:

 
 #include < unistd.h>
 
 int setuid(uid_t uid)
 int seteuid(uid_t euid)
 int setgid(gid_t gid)
 int setegid(gid_t egid)
 
Handling permissions checking is a task performed by the kernel.

Changing File Times

When a file is created, there are three timestamps associated with the file as shown in the stat structure earlier. These are the creation time, the time of last modification, and the time that the file was last accessed.

On occasion it is useful to change the access and modification times. One particular use is in a programming environment where a programmer wishes to force re-compilation of a module. The usual way to achieve this is to run the touchcommand on the file and then recompile. For example:

 $ ls -l hello*
 
 
 -rwxr-xr-x 1 spate fcf 13397 Mar 30 11:53 hello* 
 -rw-r--r-1 spate fcf 31 Mar 30 11:52 hello.c 
 $ make hello 
 
 make: 'hello' is up to date.
 $ touch hello.
 $ ls -l hello.
 -rw-r--r-1 spate fcf 31 Mar 30 11:55 hello.
 $ make hello
 cc hello.c -o hello
 
The system calls utime() and utimes() can be used to change both the access and modification times. In some versions of UNIX, utimes() is simply implemented by calling utime().
 
 #include < sys/types.h>
 #include < utime.h>
 
 int utime(const char *filename, struct utimbuf *buf)
 
 #include < sys/time.h>
 
 int utimes(char *filename, struct timeval *tvp)
 
 struct utimbuf {
 time_t actime; /* access time *
 time_t modtime; /* modification time *
 }
 
 struct timeval {
 long tv_sec; /* seconds */ 
 long tv_usec; /* microseconds */ 
 }; 
 
By running strace, truss etc., it is possible to see how a call to touch maps onto the utime() system call as follows:
 
 $ strace touch myfile 2>&1 | grep utime
 utime("myfile", NULL) = 
 
To change just the access time of the file, the touch command must first determine what the modification time of the file is. In this case, the call sequence is a little different as the following example shows:
 
 $ strace touch -a myfile
 ..
 time([984680824]) = 984680824
 open("myfile"
 O_WRONLY|O_NONBLOCK|O_CREAT|O_NOCTTY|O_LARGEFILE, 0666) = 
 fstat(3, st_mode=S_IFREG|0644, st_size=0, ...) = 
 close(3) = 
 utime("myfile", [2001/03/15-10:27:04, 2001/03/15-10:26:23]) = 
 
In this case, the current time is obtained through calling time(). The file is then opened and fstat() called to obtain the files modification time. The call to utime()then passes the original modification time and the new access time.

Truncating and Removing Files

Removing files is something that people just take for granted in the same vein as pulling up an editor and creating a new file. However, the internal operation of truncating and removing files can be a particularly complicated operation as later chapters will show. There are two calls that can be invoked to truncate a file:
 
 #include < unistd.h>
 
 int truncate(const char *path, off_t length)
 int ftruncate(int fildes, off_t length)
 
The confusing aspect of truncation is that through the calls shown here it is possible to truncate upwards, thus increasing the size of the file! If the value of length is less than the current size of the file, the file size will be changed and storage above the new size can be freed. However, if the value of length is greater than the current size, storage will be allocated to the file, and the file size will be modified to reflect the new storage. To remove a file, the unlink()system call can be invoked:

 #include < unistd.h>
 
 int unlink(const char *path)
 
The call is appropriately named since it does not necessarily remove the file but decrements the files link count. If the link count reaches zero, the file is indeed removed as the following example shows:
 
 $ touch myfile
 $ ls -l myfile
 -rw-r--r-1 spate fcf 0 Mar 15 11:09 myfile
 $ ln myfile myfile2
 $ ls -l myfile*
 -rw-r--r-2 spate fcf 0 Mar 15 11:09 myfile
 -rw-r--r-2 spate fcf 0 Mar 15 11:09 myfile2
 $ rm myfile
 $ ls -l myfile*
 -rw-r--r-1 spate fcf 0 Mar 15 11:09 myfile2
 $ rm myfile2
 $ ls -l myfile*
 ls: myfile*: No such file or directory
 
When myfile is created it has a link count of 1. Creation of the hard link (myfile2) increases the link count. In this case there are two directory entries (myfileand myfile2), but they point to the same file. To remove myfile, the unlink() system call is invoked, which decrements the link count and removes the directory entry for myfile.

Directories

There are a number of routines that relate to directories. As with other simple UNIX commands, they often have a close correspondence to the system calls that they call, as shown in Table 2.2.

The arguments passed to most directory operations is dependent on where in the file hierarchy the caller is at the time of the call, together with the pathname passed to the command:

Current working directory. This is where the calling process is at the time of the call; it can be obtained through use of pwd from the shell or getcwd() from within a C program.

Absolute pathname. An absolute pathname is one that starts with the character /. Thus to get to the base filename, the full pathname starting at / must be parsed. The pathname /etc/passwd is absolute. Relative pathname. A relative pathname does not contain / as the first character and starts from the current working directory. For example, to reach the same passwd file by specifying passwd the current working directory must be /etc.

 
 Table 2.2 Directory Related Operations 
 
 COMMAND 	SYSTEM CALL 		DESCRIPTION 
 mkdir 		mkdir() 		Make a new directory 
 rmdir 		rmdir() 		Remove a directory 
 pwd 		getcwd() 		Display the current working directory 
 cd			chdir() 		Change directory 
 			fchdir() 
 chroot 		chroot() 		Change the root directory 
 
The following example shows how these calls can be used together:
 
 $ cat dir.
 #include < sys/stat.h>
 #include < sys/types.h>
 #include < sys/param.h>
 #include < fcntl.h>
 #include < unistd.h>
 
 main(
 printf("cwd = %s\n", getcwd(NULL, MAXPATHLEN))
 mkdir("mydir", S_IRWXU)
 chdir("mydir")
 printf("cwd = %s\n", getcwd(NULL, MAXPATHLEN))
 chdir("..")
 rmdir("mydir")
 }
 
 $ make dir
 cc -o dir dir.
 $ ./dir
 cwd = /h/h065/spate/tmp
 cwd = /h/h065/spate/tmp/mydir
 

Special Files

A special file is a file that has no associated storage but can be used to gain access to a device. The goal here is to be able to access a device using the same mechanisms by which regular files and directories can be accessed. Thus, callers are able to invoke open(), read(), and write() in the same way that these system calls can be used on regular files. One noticeable difference between special files and other file types can be seen by issuing an ls command as follows:
 
 $ ls -l /dev/vx/*dsk/homedg/
 brw------ 1 root root 142,4002 Jun 5 1999 /dev/vx/dsk/homedg/
 crw------ 1 root root 142,4002 Dec 5 21:48 /dev/vx/rdsk/homedg/
 
In this example there are two device files denoted by the b and c as the first character displayed on each line. This letter indicates the type of device that this file represents. Block devices are represented by the letter b while character devices are represented by the letter c. For block devices, data is accessed in fixed-size blocks while for character devices data can be accessed in multiple different sized blocks ranging from a single character upwards. Device special files are created with the mknod command as follows:
 
 mknod name b major minor
 mknod name c major minor
 
For example, to create the above two files, execute the following commands:
 
 # mknod /dev/vx/dsk/homedg/h b 142 4002
 # mknod /dev/vx/rdsk/homedg/h c 142 4002
 
The major number is used to point to the device driver that controls the device, while the minor number is a private field used by the device driver. The mknodcommand is built on top of the mknod()system call:
 
 #include < sys/stat.h>
 
 int mknod(const char *path, mode_t mode, dev_t dev)
 
The mode argument specifies the type of file to be created, which can be one of the following:
 
 S_IFIFO. FIFO special file (named pipe). 
 
 S_IFCHR. Character special file. 
 
 S_IFDIR. Directory file. 
 
 S_IFBLK. Block special file. 
 
 S_IFREG. Regular file.
 
The file access permissions are also passed in through the mode argument. The permissions are constructed from a bitwise OR for which the values are the same as for the chmod() system call as outlined in the section Changing File Permissions earlier in this chapter.

Symbolic Links and Hard Links

Symbolic links and hard links can be created using the ln command, which in turn maps onto the link() and symlink() system calls. Both prototypes are shown below:
 
 #include < unistd.h>
 
 int link(const char *existing, const char *new)
 int symlink(const char *name1, const char *name2)
 
The section Truncating and Removing Files earlier in this chapter describes hard links and showed the effects that link() and unlink() have on the underlying file. Symbolic links are managed in a very different manner by the filesystem as the following example shows:
 
 $ echo "Hello world" > myfile
 $ ls -l myfile
 -rw-r--r-1 spate fcf 12 Mar 15 12:17 myfile
 $ cat myfile
 Hello world
 $ strace ln -s myfile mysymlink 2>&1 | grep link
 execve("/bin/ln", ["ln", "-s", "myfile"
 "mysymlink"], [/* 39 vars */]) = 
 lstat("mysymlink", 0xbffff660) = -1 ENOENT (No such file/directory)
 symlink("myfile", "mysymlink") = 
 $ ls -l my*
 -rw-r--r-1 spate fcf 12 Mar 15 12:17 myfile
 lrwxrwxrwx 1 spate fcf 6 Mar 15 12:18 mysymlink -> myfile
 $ cat mysymlink
 Hello world
 $ rm myfile
 $ cat mysymlink
 cat: mysymlink: No such file or directory
 
The ln command checks to see if a file called mysymlinkalready exists and then calls symlink() to create the symbolic link. There are two things to notice here. First of all, after the symbolic link is created, the link count of myfile does not change. Secondly, the size of mysymlink is 6 bytes, which is the length of the string myfile. Because creating a symbolic link does not change the file it points to in any way, after myfile is removed, mysymlink does not point to anything as the example shows.

Named Pipes

Although Inter Process Communication is beyond the scope of a book on filesystems, since named pipes are stored in the filesystem as a separate file type, they should be given some mention here. A named pipe is a means by which unrelated processes can communicate. A simple example will show how this all works:
 
 $ mkfifo mypipe
 $ ls -l mypipe
 prw-r--r-1 spate fcf 0 Mar 13 11:29 mypipe
 $ echo "Hello world" > mypipe 
 
 [1] 2010
 $ cat < mypipe
 Hello world
 [1]+ Done echo "Hello world" >mypipe
 
The mkfifocommand makes use of the mknod() system call. The filesystem records the fact that the file is a named pipe. However, it has no storage associated with it and other than responding to an open request, the filesystem plays no role on the IPC mechanisms of the pipe. Pipes themselves traditionally used storage in the filesystem for temporarily storing the data. Summary It is difficult to provide an introductory chapter on file-based concepts without digging into too much detail. The chapter provided many of the basic functions available to view files, return their properties and change these properties. To better understand how the main UNIX commands are implemented and how they interact with the filesystem, the GNU fileutils package provides excellent documentation, which can be found online at:
 	www.gnu.org/manual/fileutils/html_mono/fileutils.html
 
and the source for these utilities can be found at:
 	ftp://alpha.gnu.org/gnu/fetish
 

CHAPTER3

User File I/O

Building on the principles introduced in the last chapter, this chapter describes the major file-related programmatic interfaces (at a C level) including basic file access system calls, memory mapped files, asynchronous I/O, and sparse files.

To reinforce the material, examples are provided wherever possible. Such examples include simple implementations of various UNIX commands including cat, cp, and dd.

The previous chapter described many of the basic file concepts. This chapter goes one step further and describes the different interfaces that can be called to access files. Most of the APIs described here are at the system call level. Library calls typically map directly to system calls so are not addressed in any detail here.

The material presented here is important for understanding the overall implementation of filesystems in UNIX. By understanding the user-level interfaces that need to be supported, the implementation of filesystems within the kernel is easier to grasp.

Library Functions versus System Calls

System calls are functions that transfer control from the user process to the operating system kernel. Functions such as read() and write() are system calls. The process invokes them with the appropriate arguments, control transfers to the kernel where the system call is executed, results are passed back to the calling process, and finally, control is passed back to the user process.

Library functions typically provide a richer set of features. For example, the fread() library function reads a number of elements of data of specified size from a file. While presenting this formatted data to the user, internally it will call the read()system call to actually read data from the file.

Library functions are implemented on top of system calls. The decision whether to use system calls or library functions is largely dependent on the application being written. Applications wishing to have much more control over how they perform I/O in order to optimize for performance may well invoke system calls directly. If an application writer wishes to use many of the features that are available at the library level, this could save a fair amount of programming effort. System calls can consume more time than invoking library functions because they involve transferring control of the process from user mode to kernel mode. However, the implementation of different library functions may not meet the needs of the particular application. In other words, whether to use library functions or systems calls is not an obvious choice because it very much depends on the application being written.

Which Header Files to Use?

The UNIX header files are an excellent source of information to understand user-level programming and also kernel-level data structures. Most of the header files that are needed for user level programming can be found under /usr/includeand /usr/include/sys.

The header files that are needed are shown in the manual page of the library function or system call to be used. For example, using the stat() system call requires the following two header files:

 #include < sys/types.h>
 #include < sys/stat.h>
 
 int stat(const char path, struct stat buf)
 
The stat.h header file defines the stat structure. The types.h header file defines the types of each of the fields in the stat structure.

Header files that reside in /usr/include are used purely by applications. Those header files that reside in /usr/include/sys are also used by the kernel. Using stat() as an example, a reference to the stat structure is passed from the user process to the kernel, the kernel fills in the fields of the structure and then returns. Thus, in many circumstances, both user processes and the kernel need to understand the same structures and data types.

The Six Basic File Operations

Most file creation and file I/O needs can be met by the six basic system calls shown in Table 3.1. This section uses these commands to show a basic implementation of the UNIX cat command, which is one of the easiest of the UNIX commands to implement.

However, before giving its implementation, it is necessary to describe the terms standard input, standard output, and standard error. As described in the section File Descriptors in Chapter 2, the first file that is opened by a user process is assigned a file descriptor value of 3. When the new process is created, it typically inherits the first three file descriptors from its parent. These file descriptors (0, 1, and 2) have a special meaning to routines in the C runtime library and refer to the standard input, standard output, and standard error of the process respectively. When using library routines, a file stream is specified that determines where data is to be read from or written to. Some functions such as printf() write to standard output by default. For other routines such as fprintf(), the file stream must be specified. For standard output, stdout may be used and for standard error, stderrmay be used. Similarly, when using routines that require an input stream, stdin may be used. Chapter 5 describes the implementation of the standard I/O library. For now simply consider them as a layer on top of file descriptors.

When directly invoking system calls, which requires file descriptors, the constants STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO may be used. These values are defined in unistd.has follows:

 
 #define STDIN_FILENO 0 
 #define STDOUT_FILENO 1 
 #define STDERR_FILENO 2 
 
Looking at the implementation of the catcommand, the program must be able to use standard input, output, and error to handle invocations such as:
 
 $ cat # read from standard input
 $ cat file # read from 'file'
 $ cat file > file2 # redirect standard output
 
Thus there is a small amount parsing to be performed before the program knows which file to read from and which file to write to. The program source is shown below:
 
 1 #include < sys/types.h> 
 2 #include < sys/stat.h> 
 3 #include < fcntl.h> 
 4 #include < unistd.h> 
 6 #define BUFSZ 512 
 8 main(int argc, char argv) {
 10 char buf[BUFSZ]; 
 11 int ifd, ofd, nread; 
 13 get_fds(argc, argv, &ifd, &ofd);
 14 while ((nread = read(ifd, buf, BUFSZ)) != 0) 
 15 write(ofd, buf, nread);
 16 }
 17 }
 
 Table 3.1 The Six Basic System Calls Needed for File I/O 
 
 SYSTEM CALL 	FUNCTION 
 open() 			Open an existing file or create a new file 
 creat() 		Create a new file 
 close() 		Close an already open file 
 lseek()			Seek to a specified position in the file 
 read()			Read data from the file from the current position 
 write()			Write data starting at the current position 
 
 
As previously mentioned, there is actually very little work to do in the main program. The get_fds() function, which is not shown here, is responsible for assigning the appropriate file descriptors to ifd and ofdbased on the following input:
 
 $ mycat
 ifd = STDIN_FILENO
 ofd = STDOUT_FILENO
 
 
 $ mycat file
 ifd = open(file, O_RDONLY)
 ofd = STDOUT_FILENO
 
 
 $ mycat > file
 ifd = STDIN_FILENO
 ofd = open(file, O_WRONLY | O_CREAT)
 
 
 $ mycat fileA > fileB
 ifd = open(fileA, O_RDONLY)
 ofd = open(fileB, O_WRONLY | O_CREAT)
 
 
 The following examples show the program running: 
 
 $ mycat > testfile
 Hello world
 $ mycat testfile
 Hello world
 $ mycat testfile > testfile2
 
 
 
 $ mycat testfile2
 Hello world
 $ mycat
 Hello
 Hello
 world
 world
 
To modify the program, one exercise to try is to implement the get_fds() function. Some additional exercises to try are:
 1. Number all output lines (cat -n). Parse the input strings to detect the -n. 
 2. Print all tabs as ^Iand place a $character at the end of each line (cat -ET). 
 
The previous program reads the whole file and writes out its contents. Commands such as dd allow the caller to seek to a specified block in the input file and output a specified number of blocks. Reading sequentially from the start of the file in order to get to the part which the user specified would be particularly inefficient. The lseek() system call allows the file pointer to be modified, thus allowing random access to the file. The declaration for lseek()is as follows:
 
 #include < sys/types.h>
 #include < unistd.h>
 
 off_t lseek(int fildes, off_t offset, int whence)
 
The offset and whence arguments dictate where the file pointer should be positioned:
 If whenceis SEEK_SETthe file pointer is set to offsetbytes. 
 If whence is SEEK_CUR the file pointer is set to its current location plus 
 offset. 
 If whence is SEEK_END the file pointer is set to the size of the file plus 
 offset. 
 
When a file is first opened, the file pointer is set to 0 indicating that the first byte read will be at an offset of 0 bytes from the start of the file. Each time data is read, the file pointer is incremented by the amount of data read such that the next read will start from the offset in the file referenced by the updated pointer. For example, if the first read of a file is for 1024 bytes, the file pointer for the next read will be set to 0+ 1024 = 1024. Reading another 1024 bytes will start from byte offset 1024. After that read the file pointer will be set to 1024 + 1024 = 2048 and so on.

By seeking throughout the input and output files, it is possible to see how the dd command can be implemented. As with many UNIX commands, most of the work is done in parsing the command line to determine the input and output files, the starting position to read, the block size for reading, and so on. The example below shows how lseek() is used to seek to a specified starting offset within the input file. In this example, all data read is written to standard output:

 
 1 #include < sys/types.h>
 2 #include < sys/stat.h>
 3 #include < fcntl.h>
 4 #include < unistd.h>
 6 #define BUFSZ 512
 8 main(int argc, char argv)
 9 { 
 10 char *buf; 
 11 int fd, nread; 
 12 off_t offset; 
 13 size_t iosize; 
 15 if (argc != 4) 
 16 printf("usage: mydd filename offset size\n");
 18 fd = open(argv[1], O_RDONLY);
 19 if (fd < 0) 
 20 printf("unable to open file\n");
 21 exit(1);
 22 
 23 offset = (off_t)atol(argv[2]);
 24 buf = (char *)malloc(argv[3]);
 25 lseek(fd, offset, SEEK_SET);
 26 nread = read(fd, buf, iosize);
 27 write(STDOUT_FILENO, buf, nread);
 28 
 
Using a large file as an example, try different offsets and sizes and determine the effect on performance. Also try multiple runs of the program. Some of the effects seen may not be as expected. The section Data and Attribute Caching, a bit later in this chapter, discusses some of these effects.

Duplicate File Descriptors

The section File Descriptors, in Chapter 2, introduced the concept of file descriptors. Typically a file descriptor is returned in response to an open() or creat() system call. The dup() system call allows a user to duplicate an existing open file descriptor.
 
 #include < unistd.h>
 
 int dup(int fildes)
 
There are a number of uses for dup() that are really beyond the scope of this book. However, the shell often uses dup()when connecting the input and output streams of processes via pipes.

Seeking and I/O Combined

The pread() and pwrite() system calls combine the effects of lseek() and read()(or write()) into a single system call. This provides some improvement in performance although the net effect will only really be visible in an application that has a very I/O intensive workload. However, both interfaces are supported by the Single UNIX Specification and should be accessible in most UNIX environments. The definition of these interfaces is as follows:
 
 #include < unistd.h>
 
 ssize_t pread(int fildes, void buf, size_t nbyte, off_t offset)
 ssize_t pwrite(int fildes, const void buf, size_t nbyte,
 
 off_t offset)
 
The example below continues on from the dd program described earlier and shows the use of combining the lseek()with read() and write() calls:
 1 #include < sys/types.h>
 2 #include < sys/stat.h>
 3 #include < fcntl.h>
 4 #include < unistd.h>
 6 main(int argc, char argv)
 7 { 
 8 char *buf; 
 9 int ifd, ofd, nread; 
 10 off_t inoffset, outoffset; 
 11 size_t insize, outsize; 
 12 
 13 if (argc != 7) { 
 14 printf("usage: mydd infilename in_offset" 
 15 " in_size outfilename out_offset" 
 16 " out_size\n"); 
 17 } 
 18 ifd = open(argv[1], O_RDONLY); 
 19 if (ifd < 0) { 
 20 printf("unable to open %s\n", argv[1]); 
 21 exit(1); 
 22 } 
 23 ofd = open(argv[4], O_WRONLY); 
 24 if (ofd < 0) { 
 25 printf("unable to open %s\n", argv[4]); 
 26 exit(1); 
 27 } 
 28 inoffset = (off_t)atol(argv[2]); 
 29 insize = (size_t)atol(argv[3])
 30 outoffset = (off_t)atol(argv[5])
 31 outsize = (size_t)atol(argv[6])
 32 buf = (char *)malloc(insize)
 33 if (insize < outsize)
 34 outsize = insize;
 35
 36 nread = pread(ifd, buf, insize, inoffset)
 37 pwrite(ofd, buf,
 38 (nread < outsize) ? nread : outsize, outoffset)
 39 
 
The simple example below shows how the program is run:
 
 $ cat fileA
 0123456789
 $ cat fileB
 
 
 $ mydd2 fileA 2 4 fileB 4 
 $ cat fileA
 0123456789
 $ cat fileB
 ----234-
 
To indicate how the performance may be improved through the use of pread() and pwrite() the I/O loop was repeated 1 million times and a call was made to time()to determine how many seconds it took to execute the loop between this and the earlier example.

For the pread()/pwrite() combination the average time to complete the I/O loop was 25 seconds while for the lseek()/read() and lseek()/write() combinations the average time was 35 seconds, which shows a considerable difference.

This test shows the advantage of pread() and pwrite() in its best form. In general though, if an lseek() is immediately followed by a read() or write(), the two calls should be combined.

Data and Attribute Caching

There are a number of flags that can be passed to open() that control various aspects of the I/O. Also, some filesystems support additional but non standard methods for improving I/O performance.

Firstly, there are three options, supported under the Single UNIX Specification, that can be passed to open() that have an impact on subsequent I/O operations. When a write takes place, there are two items of data that must be written to disk, namely the file data and the files inode. An inode is the object stored on disk that describes the file, including the properties seen by calling stat() together with a block map of all data blocks associated with the file.

The three options that are supported from a standards perspective are:

1. O_SYNC.For all types of writes, whether allocation is required or not, the data and any meta-data updates are committed to disk before the write returns. For reads, the access time stamp will be updated before the read returns.

2. O_DSYNC. When a write occurs, the data will be committed to disk before the write returns but the files meta-data may not be written to disk at this stage. This will result in better I/O throughput because, if implemented efficiently by the filesystem, the number of inode updates will be minimized, effectively halving the number of writes. Typically, if the write results in an allocation to the file (a write over a hole or beyond the end of the file) the meta-data is also written to disk. However, if the write does not involve an allocation, the timestamps will typically not be written synchronously.

3. O_RSYNC. If both the O_RSYNC and O_DSYNC flags are set, the read returns after the data has been read and the file attributes have been updated on disk, with the exception of file timestamps that may be written later. If there are any writes pending that cover the range of data to be read, these writes are committed before the read returns.

If both the O_RSYNC and O_SYNC flags are set, the behavior is identical to that of setting O_RSYNC and O_DSYNC except that all file attributes changed by the read operation (including all time attributes) must also be committed to disk before the read returns. Which option to choose is dependent on the application. For I/O intensive applications where timestamps updates are not particularly important, there can be a significant performance boost by using O_DSYNCin place of O_SYNC.

VxFS Caching Advisories

Some filesystems provide non standard means of improving I/O performance by offering additional features. For example, the VERITAS filesystem, VxFS, provides the noatime mount option that disables access time updates; this is usually fine for most application environments.

The following example shows the effect that selecting O_SYNC versus O_DSYNC can have on an application:

 #include < sys/unistd.h>
 #include < sys/types.h>
 #include < fcntl.h>
 
 main(int argc, char argv[]{
 
 char buf[4096]
 int i, fd, advisory;
 
 fd = open("myfile", O_WRONLY|O_DSYNC)
 for (i=0 ; i<1024 ; i++) 
 write(fd, buf, 4096)
 
By having a program that is identical to the previous with the exception of setting O_SYNCin place of O_DSYNC, the output of the two programs is as follows:
 
 # time ./sync
 real 0m8.33s
 user 0m0.03s
 sys 0m1.92s
 # time ./dsync
 real 0m6.44s
 user 0m0.02s
 sys 0m0.69s
 
This clearly shows the increase in time when selecting O_SYNC. VxFS offers a number of other advisories that go beyond what is currently supported by the traditional UNIX standards. These options can only be accessed through use of the ioctl() system call. These advisories give an application writer more control over a number of I/O parameters:
VX_RANDOM. Filesystems try to determine the I/O pattern in order to perform read ahead to maximize performance. This advisory indicates that the I/O pattern is random and therefore read ahead should not be performed.
VX_SEQ. This advisory indicates that the file is being accessed sequentially. In this case the filesystem should maximize read ahead.
VX_DIRECT. When data is transferred to or from the user buffer and disk, a copy is first made into the kernel buffer or page cache, which is a cache of recently accessed file data. Although this cache can significantly help performance by avoiding a read of data from disk for a second access, the double copying of data has an impact on performance. The VX_DIRECT advisory avoids this double buffering by copying data directly between the users buffer and disk.
VX_NOREUSE. If data is only to be read once, the in-kernel cache is not needed. This advisory informs the filesystem that the data does not need to be retained for subsequent access.
VX_DSYNC. This option was in existence for a number of years before the O_DSYNC mode was adopted by the UNIX standards committees. It can still be accessed on platforms where O_DSYNC is not supported.

Before showing how these caching advisories can be used it is first necessary to describe how to use the ioctl() system call. The definition of ioctl(), which is not part of any UNIX standard, differs slightly from platform to platform by requiring different header files. The basic definition is as follows:

 #include < unistd.h> # Solaris
 #include < stropts.h> # Solaris, AIX and HP-UX
 #include < sys/ioctl.h> # Linux
 
 int ioctl(int fildes, int request, /* arg ... */)
 
Note that AIX does not, at the time of writing, support ioctl() calls on regular files. Ioctl calls may be made to VxFS regular files, but the operation is not supported generally.

The following program shows how the caching advisories are used in practice. The program takes VX_SEQ, VX_RANDOM, or VX_DIRECT as an argument and reads a 1MB file in 4096 byte chunks.

 #include < sys/unistd.h>
 #include < sys/types.h>
 #include < fcntl.h>
 #include "sys/fs/vx_ioctl.h"
 
 #define MB (1024 * 1024)
 
 main(int argc, char argv[]
 { 
 char *buf; 
 int i, fd, advisory; 
 long pagesize, pagemask;
  if (argc != 2) exit(1);
 if (strcmp(argv[1], "VX_SEQ") == 0) 
 {
 	advisory = VX_SEQ; 
 } else if (strcmp(argv[1], "VX_RANDOM") == 0) 
 {
 advisory = VX_RANDOM; 
 } else if (strcmp(argv[1], "VX_DIRECT") == 0) 
 
 advisory = VX_DIRECT; 
 pagesize = sysconf(_SC_PAGESIZE)
 pagemask = pagesize - 1; 
 buf = (char *)(malloc(2 * pagesize) & pagemask)
 buf = (char *)(((long)buf + pagesize) & ~pagemask)
 
 
 fd = open("myfile", O_RDWR)
 ioctl(fd, VX_SETCACHE, advisory)
 for (i=0 ; i< MB ; i++) {
 read(fd, buf, 4096)
 }
 }
 
The program was run three times passing each of the advisories in turn. The timescommand was run to display the time to run the program and the amount of time that was spent in user and system space.
 
 VX_SEQ
 
 real 2:47.6
 user 5.9
 sys 2:41.4
 
 VX_DIRECT
 
 real 2:35.7
 user 6.7
 sys 2:28.7
 
 VX_RANDOM
 
 real 2:43.6
 user 5.2
 sys 2:38.1
 
Although the time difference between the runs shown here is not significant, the appropriate use of these caching advisories can have a significant impact on overall performance of large applications.

Miscellaneous Open Options

Through use of the O_NONBLOCK and O_NDELAY flags that can be passed to open(), applications can gain some additional control in the case where they may block for reads and writes. O_EXCL.If both O_CREATand O_EXCL are set, a call to open()fails if the file exists. If the O_CREAT option is not set, the effect of passing O_EXCL is undefined.
O_NONBLOCK / O_NDELAY. These flags can affect subsequent reads and writes. If both the O_NDELAY and O_NONBLOCK flags are set, O_NONBLOCK takes precedence. Because both options are for use with pipes, they wont be discussed further here.

File and Record Locking

If multiple processes are writing to a file at the same time, the result is non deterministic. Within the UNIX kernel, only one write to the same file may proceed at any given time. However, if multiple processes are writing to the file, the order in which they run can differ depending on many different factors. Obviously this is highly undesirable and results in a need to lock files at an application level, whether the whole file or specific sections of a file. Sections of a file are also called records, hence file and record locking.

There are numerous uses for file locking. However, looking at database file access gives an excellent example of the types of locks that applications require. For example, it is important that all users wishing to view database records are able to do so simultaneously. When updating records it is imperative that while one record is being updated, other users are still able to access other records. Finally it is imperative that records are updated in a time-ordered manner.

There are two types of locks that can be used to coordinate access to files, namely mandatory and advisory locks. With advisory locking, it is possible for cooperating processes to safely access a file in a controlled manner. Mandatory locking is somewhat of a hack and will be described later. The majority of this section will concentrate on advisory locking, sometimes called record locking.

Advisory Locking

There are three functions which can be used for advisory locking. These are lockf(), flock(), and fcntl(). The flock()function defined below:
 
 /usr/ucb/cc [ flag ... ] file ..
 #include < sys/file.h>
 
 int flock(fd, operation)
 int fd, operation;
 
was introduced in BSD UNIX and is not supported under the Single UNIX Specification standard. It sets an advisory lock on the whole file. The lock type, specified by the operation argument, may be exclusive (LOCK_EX) or shared (LOCK_SH). By ORing operation with LOCK_NB, if the file is already locked, EAGAINwill be returned. The LOCK_UNoperationremoves the lock.

The lockf() function, which is typically implemented as a call to fcntl(), can be invoked to apply or remove an advisory lock on a segment of a file as follows:

 #include < sys/file.h>
 
 int lockf(int fildes, int function, off_t size)
 
To use lockf(), the file must have been opened with one of the O_WRONLY or O_RDWR flags. The size argument specifies the number of bytes to be locked, starting from the current file pointer. Thus, a call to lseek() should be made prior to calling lockf(). If the value of size is 0 the file is locked from the current offset to the end of the file.

The functionargument can be one of the following:
F_LOCK. This command sets an exclusive lock on the file. If the file is already locked, the calling process will block until the previous lock is relinquished.
F_TLOCK. This performs the same function as the F_LOCK command but will not blockthus if the file is already locked, EAGAIN is returned.
F_ULOCK. This command unlocks a segment of the file.
F_TEST. This command is used to test whether a lock exists for the specified segment. If there is no lock for the segment, 0 is returned, otherwise -1 is returned, and errno is set to EACCES.

If the segment to be locked contains a previous locked segment, in whole or part, the result will be a new, single locked segment. Similarly, if F_ULOCKis specified, the segment of the file to be unlocked may be a subset of a previously locked segment or may cover more than one previously locked segment. If size is 0, the file is unlocked from the current file offset to the end of the file. If the segment to be unlocked is a subset of a previously locked segment, the result will be one or two smaller locked segments.

It is possible to reach deadlock if two processes make a request to lock segments of a file owned by each other. The kernel is able to detect this and, if the condition would occur, EDEADLK is returned.

Note as mentioned above that flock()is typically implemented on top of the fcntl() system call, for which there are three commands that can be passed to manage record locking. Recall the interface for fcntl():

 #include < sys/types.h>
 #include < unistd.h>
 #include < fcntl.h>
 
 int fcntl(int fildes, int cmd, ...)
 
All commands operate on the flock structure that is passed as the third argument:
 struct flock {
 	short l_type; /* F_RDLCK, F_WRLCK or F_UNLOCK */ 
 	short l_whence; /* flag for starting offset */ 
 	off_t l_start; /* relative offset in bytes */ 
 	off_t l_len; /* size; if 0 then until EOF */ 
 	pid_t l_pid; /* process ID of lock holder */ 
 }; 
 
The commands that can be passed to fcntl()are:
F_GETLK. This command returns the first lock that is covered by the flock structure specified. The information that is retrieved overwrites the fields of the structure passed.
F_SETLK. This command either sets a new lock or clears an existing lock based on the value of l_type as shown above.
F_SETLKW. This command is the same as F_SETLK with the exception that the process will block if the lock is held by another process.

Because record locking as defined by fcntl() is supported by all appropriate UNIX standards, this is the routine that should be ideally used for application portability.

The following code fragments show how advisory locking works in practice. The first program, lock, which follows, sets a writable lock on the whole of the file myfile and calls pause() to wait for a SIGUSR1 signal. After the signal arrives, a call is made to unlock the file.


 1 #include < sys/types.h> 
 2 #include < unistd.h> 
 3 #include < fcntl.h> 
 4 #include < signal.h>
 
 6 void 
 7 mysig(int signo)
 8 { 
 9 return; 
 10 } 
 11 
 12 main() 
 13 { 
 14 struct flock lk; 
 15 int fd, err; 
 16 
 17 sigset(SIGUSR1, mysig);
 18
 19 fd = open("myfile", O_WRONLY)
 20
 21 lk.l_type = F_WRLCK;
 22 lk.l_whence = SEEK_SET;
 23 lk.l_start = 0;
 24 lk.l_len = 0;
 25 lk.l_pid = getpid()
 26
 27 err = fcntl(fd, F_SETLK, &lk)
 28 printf("lock: File is locked\n")
 29 pause()
 30 lk.l_type = F_UNLCK;
 31 err = fcntl(fd, F_SETLK, &lk)
 32 printf("lock: File is unlocked\n")
 33 }
 
Note that the process ID of this process is placed in l_pid so that anyone requesting information about the lock will be able to determine how to identify this process.

The next program (mycatl) is a modified version of the cat program that will only display the file if there are no write locks held on the file. If a lock is detected, the program loops up to 5 times waiting for the lock to be released. Because the lock will still be held by the lock program, mycatl will extract the process ID from the flock structure returned by fcntl() and post a SIGUSR1 signal. This is handled by the lock program which then unlocks the file.

 1 #include < sys/types.h> 
 2 #include < sys/stat.h> 
 3 #include < fcntl.h> 
 4 #include < unistd.h> 
 5 #include < signal.h> 
 7 pid_
 8 is_locked(int fd) 
 9 {
 10 struct flock lk; 
 11 
 12 lk.l_type = F_RDLCK; 
 13 lk.l_whence = SEEK_SET; 
 14 lk.l_start = 0; 
 15 lk.l_len = 0; 
 16 lk.l_pid = 0; 
 17 
 18 fcntl(fd, F_GETLK, &lk); 
 19 return (lk.l_type == F_UNLCK) ? 0 : lk.l_pid; 
 20 } 
 21 
 22 main() 
 23 { 
 24 struct flock lk; 
 25 int i, fd, err; 
 26 pid_t pid; 
 27 
 28 fd = open("myfile", O_RDONLY); 
 29 
 30 for (i = 0 ; i < 5 ; i++) { 
 31 if ((pid = is_locked(fd)) == 0) { 
 32 catfile(fd); 
 33 exit(0); 
 34 } else { 
 35 printf("mycatl: File is locked ...\n"); 
 36 sleep(1); 
 37 } 
 38 } 
 
 39 kill(pid, SIGUSR1)
 40 while ((pid = is_locked(fd)) != 0) {
 41 printf("mycatl: Waiting for lock release\n")
 42 sleep(1)
 43 }
 44 catfile(fd)
 45 }
 
Note the use of fcntl()in the mycatl program. If no lock exists on the file that would interfere with the lock requested (in this case the program is asking for a read lock on the whole file), the l_type field is set to F_UNLCK. When the program is run, the following can be seen:
 
 $ cat myfile
 Hello world
 $ lock&
 
 [1] 2448
 lock: File is locked
 $ mycatl
 mycatl: File is locked ..
 mycatl: File is locked ..
 mycatl: File is locked ..
 mycatl: File is locked ..
 mycatl: File is locked ..
 mycatl: Waiting for lock release
 
 
 
 lock: File is unlocked
 Hello world
 [1]+ Exit 23 ./lock
 
The following example shows where advisory locking fails to become effective if processes are not cooperating:
 $ lock&
 
 [1] 2494
 lock: File is locked
 $ cat myfile
 Hello world
 $ rm myfile
 $ jobs
 [1]+ Running ./lock 
 
In this case, although the file has a segment lock, a non-cooperating process can still access the file, thus the real cat program can display the file and the file can also be removed! Note that removing a file involves calling the unlink()system call. The file is not actually removed until the last close. In this case the lock program still has the file open. The file will actually be removed once the lock program exits.

Mandatory Locking

As the previous example shows, if all processes accessing the same file do not cooperate through the use of advisory locks, unpredictable results can occur. Mandatory locking provides file locking between non-cooperating processes. Unfortunately, the implementation, which arrived with SVR3, leaves something to be desired.

Mandatory locking can be enabled on a file if the set group ID bit is switched on and the group execute bit is switched offa combination that together does not otherwise make any sense. Thus if the following were executed on a system that supports mandatory locking:

 $ lock&
 [1] 12096
 lock: File is locked
 $ cat myfile # The cat program blocks here
 
the cat program will block until the lock is relinquished. Note that mandatory locking is not supported by the major UNIX standards so further details will not be described here.

File Control Operations

The fcntl() system call is designed to provide file control functions for open files. The definition was shown in a previous section, File and Record Locking, earlier in the chapter. It is repeated below:
 #include < sys/types.h>
 #include < unistd.h>
 #include < fcntl.h>
 
 int fcntl(int fildes, int cmd, ...)
 
The file descriptor refers to a previously opened file and the cmd argument is one of the commands shown below:
F_DUPFD. This command returns a new file descriptor that is the lowest numbered file descriptor available (and is not already open). The file descriptor returned will be greater than or equal to the third argument. The new file descriptor refers to the same open file as the original file descriptor and shares any locks. The FD_CLOEXEC (see F_SETFD below) flag associated with the new file descriptor is cleared to keep the file open across calls to one of the exec functions.
F_GETFD. This command returns the flags associated with the specified file descriptor. This is a little bit of a misnomer because there has only ever been one flag, the FD_CLOEXEC flag that indicates that the file should be closed following a successful call to exec().
F_SETFD. This command sets the FD_CLOEXECflag.
F_GETFL. This command returns the file status flags and file access modes for fildes. The file access modes can be extracted from the return value using the mask O_ACCMODE. The flags are O_RDONLY, O_WRONLYand O_RDWR. The file status flags, as described in the sections Data and Attribute Caching and Miscellaneous Open Options, earlier in this chapter, can be either O_APPEND, O_SYNC, O_DSYNC, O_RSYNC, or O_NONBLOCK.
F_SETFL. This command sets the file status flags for the specified file descriptor.
F_GETLK. This command retrieves information about an advisory lock. See the section File and Record Locking, earlier in this chapter, for further information.
F_SETLK. This command clears or sets an advisory lock. See the section File and Record Locking, earlier in this chapter, for further information.
F_SETLKW. This command also clears or sets an advisory lock. See the section File and Record Locking, earlier in this chapter, for further information.

Vectored Reads and Writes

If the data that a process reads from a file in a single read needs to placed in different areas of memory, this would typically involve more than one call to read(). However, the readv() system call can be used to perform a single read from the file but copy the data to the multiple memory locations, which can cut down on system call overhead and therefore increase performance in environments where there is a lot of I/O activity. When writing to files the writev()system call can be used.

Here are the definitions for both functions:

 #include < sys/uio.h>
 
 ssize_t readv(int fildes, const struct iovec iov, int iovcnt)
 ssize_t writev(int fildes, const struct iovec iov, int iovcnt)
 
Note that although multiple I/Os can be combined, they must all be contiguous within the file.
 struct uio {
  void *iov_base; /* Address in memory of buffer for r/w *
  size_t iov_len; /* Size of the above buffer in memory *
 }
 
Figure 3.1 shows how the transfer of data occurs for a read operation. The shading on the areas of the file and the address space show where the data will be placed after the read has completed.

The following program corresponds to the example shown in Figure 3.1:

 1 #include < sys/uio.h>
 2 #include < unistd.h>
 3 #include < fcntl.h>
 4 
 5 main() 
 6 { 
 7 struct iovec uiop[3]; 
 8 void *addr1, *addr2, *addr3; 
 9 int fd, nbytes; 
 10 
 11 addr1 = (void *)malloc(4096)
 12 addr2 = (void *)malloc(4096)
 13 addr3 = (void *)malloc(4096)
 14
 15 uiop[0].iov_base = addr1; uiop[0].iov_len = 512;
 16 uiop[1].iov_base = addr2; uiop[1].iov_len = 512;
 17 uiop[2].iov_base = addr3; uiop[2].iov_len = 1024;
 18
 19 fd = open("myfile", O_RDONLY)
 20 nbytes = readv(fd, uiop, 3)
 21 printf("number of bytes read = %d\n", nbytes)
 22 
 
Note that readv() returns the number of bytes read. When this program runs, the result is 2048 bytes, the total number of bytes obtained by adding the three individual iovec structures.
 $ readv
 number of bytes read = 2048
 

Asynchronous I/O

By issuing an I/O asynchronously, an application can continue with other work rather than waiting for the I/O to complete. There have been numerous different implementations of asynchronous I/O (commonly referred to as async I/O) over the years. This section will describe the interfaces as supported by the Single UNIX Specification.

As an example of where async I/O is commonly used, consider the Oracle database writer process (DBWR), one of the main Oracle processes; its role is to manage the Oracle buffer cache, a user-level cache of database blocks. This involves responding to read requests and writing dirty (modified) buffers to disk.

In an active database, the work of DBWR is complicated by the fact that it is constantly writing dirty buffers to disk in order to allow new blocks to be read. Oracle employs two methods to help alleviate some of the performance bottlenecks. First, it supports multiple DBWR processes (called DBWR slave processes); the second option, which greatly improves throughput, is through use of async I/O. If I/O operations are being performed asynchronously, the DBWR processes can be doing other work, whether flushing more buffers to disk, reading data from disk, or other internal functions.

All of the Single UNIX Specification async I/O operations center around an I/O control block defined by the aiocbstructure as follows:


 struct aiocb { 
 int aio_fildes; /* file descriptor */ 
 off_t aio_offset; /* file offset */ 
 volatile void *aio_buf; /* location of buffer */ 
 size_t aio_nbytes; /* length of transfer */ 
 int aio_reqprio; /* request priority offset */ 
 struct sigevent aio_sigevent; /* signal number and value */ 
 int aio_lio_opcode; /* operation to be performed */ 
 }; 
 
The fields of the aiocb structure will be described throughout this section as the various interfaces are described. The first interface to describe is aio_read():
 cc [ flag... ] file... -lrt [ library... 
 #include < aio.h>
 int aio_read(struct aiocb aiocbp)
 
The aio_read() function will read aiocbp->aio_nbytes from the file associated with file descriptor aiocbp->aio_fildes into the buffer referenced by aiocbp->aio_buf. The call returns when the I/O has been initiated. Note that the requested operation takes place at the offset in the file specified by the aio_offsetfield.

Similarly, to perform an asynchronous write operation, the function to call is aio_write()which is defined as follows:

 cc [ flag... ] file... -lrt [ library... 
 #include < aio.h>
 
 int aio_write(struct aiocb aiocbp)
 
and the fields in the aio control block used to initiate the write are the same as for an async read.

In order to retrieve the status of a pending I/O, there are two interfaces that can be used. One involves the posting of a signal and will be described later; the other involves the use of the aio_return() function as follows:

 #include < aio.h>
 
 ssize_t aio_return(struct aiocb aiocbp)
 
The aio control block that was passed to aio_read() should be passed to aio_return(). The result will either be the same as if a call to read() or write() had been made or, if the operation is still in progress, the result is undefined.

The following example shows some interesting properties of an asynchronous write:

 1 #include < aio.h>
 2 #include < time.h>
 3 #include < errno.h>
 4 
 5 #define FILESZ (1024 * 1024 * 64) 
 6 
 7 main() 
 8 { 
 9 struct aiocb aio; 
 10 void *buf; 
 11 time_t time1, time2; 
 12 int err, cnt = 0; 
 13 
 
 14 buf = (void *)malloc(FILESZ)
 15 aio.aio_fildes = open("/dev/vx/rdsk/fs1", O_WRONLY)
 16 aio.aio_buf = buf;
 17 aio.aio_offset = 0;
 18 aio.aio_nbytes = FILESZ;
 19 aio.aio_reqprio = 0;
 20
 21 time(&time1)
 22 err = aio_write(&aio)
 23 while ((err = aio_error(&aio)) == EINPROGRESS) {
 24 sleep(1)
 25 }
 26 time(&time2)
 27 printf("The I/O took %d seconds\n", time2 - time1)
 28 }
 
The program uses the raw device /dev/vx/rdsk/fs1 to write a single 64MB buffer. The aio_error()call:
 
 cc [ flag... ] file... -lrt [ library... 
 #include 
 int aio_error(const struct aiocb aiocbp)
 
can be called to determine whether the I/O has completed, is still in progress, or whether an error occurred. The return value from aio_error() will either correspond to the return value from read(), write(), or will be EINPROGRESS if the I/O is still pending. Note when the program is run:
 # aiowrite
 The I/O took 7 seconds
 
Thus if the process had issued a write through use of the write()system call, it would wait for 7 seconds before being able to do anything else. Through the use of async I/O the process is able to continue processing and then find out the status of the async I/O at a later date.

For async I/O operations that are still pending, the aio_cancel() function can be used to cancel the operation:

 cc [ flag... ] file... -lrt [ library... 
 #include < aio.h>
 
 int aio_cancel(int fildes, struct aiocb aiocbp)
 
The filedes argument refers to the open file on which a previously made async I/O, as specified by aiocbp, was issued. If aiocbp is NULL, all pending async I/O operations are canceled. Note that it is not always possible to cancel an async I/O. In many cases, the I/O will be queued at the driver level before the call from aio_read()or aio_write() returns.

As an example, following the above call to aio_write(), this code is inserted:

 err = aio_cancel(aio.aio_fildes, &aio)
 
 switch (err) {
 
  case AIO_CANCELED:
 	errstr = "AIO_CANCELED"
 	 break;
 case AIO_NOTCANCELED:
 	 errstr = "AIO_NOTCANCELED"
 	 break;
 case AIO_ALLDONE:
 	 errstr = "AIO_ALLDONE"
 	 break;
 default:
  errstr = "Call failed"
 }
 printf("Error value returned %s\n", errstr)
 
and when the program is run, the following error value is returned:
 	Error value returned AIO_CANCELED
 
In this case, the I/O operation was canceled. Consider the same program but instead of issuing a 64MB write, a small 512 byte I/O is issued:
 	Error value returned AIO_NOTCANCELED
 
In this case, the I/O was already in progress, so the kernel was unable to prevent it from completing.

As mentioned above, the Oracle DBWR process will likely issue multiple I/Os simultaneously and wait for them to complete at a later time. Multiple read() and write() system calls can be combined through the use of readv() and write() to help cut down on system call overhead. For async I/O, the lio_listio()function achieves the same result:

 #include < aio.h>
 
 int lio_listio(int mode, struct aiocb const list[], int nent, 
 struct sigevent sig)
 
The modeargument can be one of LIO_WAITin which the requesting process will block in the kernel until all I/O operations have completed or LIO_NOWAIT in which case the kernel returns control to the user as soon as the I/Os have been queued. The list argument is an array of nentaiocb structures. Note that for each aiocb structure, the aio_lio_opcode field must be set to either LIO_READ for a read operation, LIO_WRITE for a write operation, or LIO_NOP in which case the entry will be ignored.

If the mode flag is LIO_NOWAIT, the sig argument specifies the signal that should be posted to the process once the I/O has completed.

The following example uses lio_listio() to issue two async writes to different parts of the file. Once the I/O has completed, the signal handler aiohdlr() will be invoked; this displays the time that it took for both writes to complete.

 1 #include < aio.h> 
 2 #include < time.h> 
 3 #include < errno.h> 
 4 #include < signal.h> 
 6 #define FILESZ (1024 * 1024 * 64) 
 7 time_t time1, time2; 
 9 void
 10 aiohdlr(int signo)
 11 {
 12 time(&time2)
 13 printf("Time for write was %d seconds\n", time2 - time1)
 14 } 
 15 
 16 main() 
 17 { 
 18 struct sigevent mysig; 
 19 struct aiocb *laio[2]; 
 20 struct aiocb aio1, aio2; 
 21 void *buf; 
 22 char errstr; 
 23 int fd; 
 24 
 25 buf = (void *)malloc(FILESZ)
 26 fd = open("/dev/vx/rdsk/fs1", O_WRONLY)
 27
 28 aio1.aio_fildes = fd;
 29 aio1.aio_lio_opcode = LIO_WRITE;
 30 aio1.aio_buf = buf;
 31 aio1.aio_offset = 0;
 32 aio1.aio_nbytes = FILESZ;
 33 aio1.aio_reqprio = 0;
 34 laio[0] = &aio1;
 35
 36 aio2.aio_fildes = fd;
 37 aio2.aio_lio_opcode = LIO_WRITE;
 38 aio2.aio_buf = buf;
 39 aio2.aio_offset = FILESZ;
 40 aio2.aio_nbytes = FILESZ;
 41 aio2.aio_reqprio = 0;
 42 laio[1] = &aio2;
 43
 44 sigset(SIGUSR1, aiohdlr)
 45 mysig.sigev_signo = SIGUSR1;
 46 mysig.sigev_notify = SIGEV_SIGNAL;
 47 mysig.sigev_value.sival_ptr = (void *)laio;
 48
 49 time(&time1)
 50 lio_listio(LIO_NOWAIT, laio, 2, &mysig)
 51 pause()
 52 }
 
The call to lio_listio() specifies that the program should not wait and that a signal should be posted to the process after all I/Os have completed. Although not described here, it is possible to use real-time signals through which information can be passed back to the signal handler to determine which async I/O has completed. This is particularly important when there are multiple simultaneous calls to lio_listio(). Bill Gallmeisters book Posix.4: Programming for the Real World [GALL95] describes how to use real-time signals.

When the program is run the following is observed:

 # listio
 Time for write was 12 seconds
 
which clearly shows the amount of time that this process could have been performing other work rather than waiting for the I/O to complete.

Memory Mapped Files

In addition to reading and writing files through the use of read() and write(), UNIX supports the ability to map a file into the process address space and read and write to the file through memory accesses. This allows unrelated processes to access files with either shared or private mappings. Mapped files are also used by the operating system for executable files.

The mmap() system call allows a process to establish a mapping to an already open file:

 #include < sys/mman.h>
 
 void mmap(void addr, size_t len, int prot, int flags,
 int fildes, off_t off)
 
The file is mapped from an offset of off bytes within the file for len bytes. Note that the offset must be on a page size boundary. Thus, if the page size of the system is 4KB, the offset must be 0, 4096, 8192 and so on. The size of the mapping does not need to be a multiple of the page size although the kernel will round the request up to the nearest page size boundary. For example, if off is set to 0 and sizeis set to 2048, on systems with a 4KB page size, the mapping established will actually be for 4KB.

Figure 3.2 shows the relationship between the pages in the users address space and how they relate to the file being mapped. The page size of the underlying hardware platform can be determined by making a call to sysconf()as follows:

 #include < unistd.h>
 main(){
  printf("PAGESIZE = %d\n", sysconf(_SC_PAGESIZE))
 }
 
Typically the page size will be 4KB or 8KB. For example, as expected, when the program is run on an x86 processor, the following is reported:
 # ./sysconf
 PAGESIZE = 4096
 
while for Sparc 9 based hardware:
 # ./sysconf
 PAGESIZE = 8192
 
Although it is possible for the application to specify the address to which the file should be mapped, it is recommended that the addr field be set to 0 so that the system has the freedom to choose which address the mapping will start from. The operating system dynamic linker places parts of the executable program in various memory locations. The amount of memory used differs from one process to the next. Thus, an application should never rely on locating data at the same place in memory even within the same operating system and hardware architecture. The address at which the mapping is established is returned if the call to mmap() is successful, otherwise 0 is returned.

Note that after the file has been mapped it can be closed and still accessed through the mapping.

Before describing the other parameters, here is a very simple example showing the basics of mmap(): 1 #include < sys/types.h> 2 #include < sys/stat.h> 3 #include < sys/mman.h> 4 #include < fcntl.h> 5 #include < unistd.h> 7 #define MAPSZ 4096 9 main() 10 { 11 char *addr, c; 12 int fd; 13 14 fd = open("/etc/passwd", O_RDONLY) 15 addr = (char *)mmap(NULL, MAPSZ, 16 PROT_READ, MAP_SHARED, fd, 0) 17 close(fd) 18 for (;;) { 19 c = *addr; 20 putchar(c) 21 addr++ 22 if (c == \n) { 23 exit(0) 24 } 25 } 26 }
The /etc/passwd file is opened and a call to mmap() is made to map the first MAPSZ bytes of the file. A file offset of 0 is passed. The PROT_READ and MAP_SHAREDarguments describe the type of mapping and how it relates to other processes that map the same file. The prot argument (in this case PROT_READ) can be one of the following:
PROT_READ. The data can be read.
PROT_WRITE. The data can be written.
PROT_EXEC. The data can be executed.
PROT_NONE. The data cannot be accessed.

Note that the different access types can be combined. For example, to specify read and write access a combination of (PROT_READ|PROT_WRITE) may be specified. By specifying PROT_EXEC it is possible for application writers to produce their own dynamic library mechanisms. The PROT_NONE argument can be used for user level memory management by preventing access to certain parts of memory at certain times. Note that PROT_NONE cannot be used in conjunction with any other flags.

The flagsargument can be one of the following:
MAP_SHARED. Any changes made through the mapping will be reflected back to the mapped file and are visible by other processes calling mmap() and specifying MAP_SHARED.
MAP_PRIVATE. Any changes made through the mapping are private to this process and are not reflected back to the file.
MAP_FIXED. The addr argument should be interpreted exactly. This argument will be typically used by dynamic linkers to ensure that program text and data are laid out in the same place in memory for each process. If MAP_FIXED is specified and the area specified in the mapping covers an already existing mapping, the initial mapping is first unmapped.

Note that in some versions of UNIX, the flags have been enhanced to include operations that are not covered by the Single UNIX Specification. For example, on the Solaris operating system, the MAP_NORESERVE flag indicates that swap space should not be reserved. This avoids unnecessary wastage of virtual memory and is especially useful when mappings are read-only. Note, however, that this flag is not portable to other versions of UNIX.

To give a more concrete example of the use of mmap(), an abbreviated implementation of the cp utility is given. This is how some versions of UNIX actually implement cp.

 1 #include < sys/types.h> 
 2 #include < sys/stat.h> 
 3 #include < sys/mman.h> 
 4 #include < fcntl.h> 
 5 #include < unistd.h> 
 7 #define MAPSZ 4096 
 9 main(int argc, char argv)
 10 { 
 11 struct stat st; 
 12 size_t iosz; 
 13 off_t off = 0; 
 14 void *addr; 
 15 int ifd, ofd; 
 16 
 17 if (argc != 3) { 
 18 
 printf("Usage: mycp srcfile destfile\n")
 19 exit(1)
 20 }
 21 if ((ifd = open(argv[1], O_RDONLY)) < 0) 
 22 printf("Failed to open %s\n", argv[1])
 23 
 24 if ((ofd = open(argv[2], 
 25 O_WRONLY|O_CREAT|O_TRUNC, 0777)) < 0) { 
 26 printf("Failed to open %s\n", argv[2]); 
 27 } 
 28 fstat(ifd, &st); 
 29 if (st.st_size < MAPSZ) { 
 30 addr = mmap(NULL, st.st_size, 
 31 PROT_READ, MAP_SHARED, ifd, 0); 
 32 printf("Mapping entire file\n"); 
 33 close(ifd); 
 34 write (ofd, (char *)addr, st.st_size); 
 35 } else { 
 36 printf("Mapping file by MAPSZ chunks\n"); 
 37 while (off <= st.st_size) { 
 38 addr = mmap(NULL, MAPSZ, PROT_READ, 
 39 MAP_SHARED, ifd, off); 
 40 if (MAPSZ < (st.st_size - off)) { 
 41 iosz = MAPSZ; 
 42 } else { 
 43 iosz = st.st_size - off; 
 44 } 
 45 write (ofd, (char *)addr, iosz); 
 46 off += MAPSZ; 
 47 } 
 48 } 
 49 } 
 
The file to be copied is opened and the file to copy to is created on lines 21-27. The fstat() system call is invoked on line 28 to determine the size of the file to be copied. The first call to mmap() attempts to map the whole file (line 30) for files of size less then MAPSZ. If this is successful, a single call to write() can be issued to write the contents of the mapping to the output file.

If the attempt at mapping the whole file fails, the program loops (lines 37-47) mapping sections of the file and writing them to the file to be copied.

Note that in the example here, MAP_PRIVATE could be used in place of MAP_SHARED since the file was only being read. Here is an example of the program running:

 $ cp mycp.c fileA
 $ mycp fileA fileB
 Mapping entire file
 $ diff fileA fileB
 $ cp mycp fileA
 $ mycp fileA fileB
 Mapping file by MAPSZ chunks
 $ diff fileA fileB
 
Note that if the file is to be mapped in chunks, we keep making repeated calls to mmap(). This is an extremely inefficient use of memory because each call to mmap() will establish a new mapping without first tearing down the old mapping. Eventually the process will either exceed its virtual memory quota or run out of address space if the file to be copied is very large. For example, here is a run of a modified version of the program that displays the addresses returned by mmap():
 $ dd if=/dev/zero of=20kfile bs=4096 count=
 5+0 records in
 5+0 records out
 $ mycp_profile 20kfile newfile
 Mapping file by MAPSZ chunks
 map addr = 0x40019000
 map addr = 0x4001a000
 map addr = 0x4001b000
 map addr = 0x4001c000
 map addr = 0x4001d000
 map addr = 0x4001e000
 
The different addresses show that each call to mmap()establishes a mapping at a new address. To alleviate this problem, the munmap() system call can be used to unmap a previously established mapping:
 
 #include < sys/mman.h>
 int munmap(void *addr, size_t len)
 
Thus, using the example above and adding the following line:
 	munmap(addr, iosz)
 
after line 46, the mapping established will be unmapped, freeing up both the users virtual address space and associated physical pages. Thus, running the program again and displaying the addresses returned by calling mmap()shows:
 $ mycp2 20kfile newfile
 Mapping file by MAPSZ chunks
 map addr = 0x40019000
 map addr = 0x40019000
 map addr = 0x40019000
 map addr = 0x40019000
 map addr = 0x40019000
 map addr = 0x40019000
 
The program determines whether to map the whole file based on the value of MAPSZ and the size of the file. One way to modify the program would be to attempt to map the whole file regardless of size and only switch to mapping in segments if the file is too large, causing the call to mmap() to fail.

After a mapping is established with a specific set of access protections, it may be desirable to change these protections over time. The mprotect() system call allows the protections to be changed:

 #include < sys/mman.h>
 
 	int mprotect(void *addr, size_t len, int prot)
 
The prot argument can be one of PROT_READ, PROT_WRITE, PROT_EXEC, PROT_NONE, or a valid combination of the flags as described above. Note that the range of the mapping specified by a call to mprotect() does not have to cover the entire range of the mapping established by a previous call to mmap(). The kernel will perform some rounding to ensure that len is rounded up to the next multiple of the page size.

The other system call that is of importance with respect to memory mapped files is msync(), which allows modifications to the mapping to be flushed to the underlying file:

 #include < sys/mman.h>
 
 int msync(void *addr, size_t len, int flags)
 
Again, the range specified by the combination of addr and len does not need to cover the entire range of the mapping. The flags argument can be one of the following:
MS_ASYNC. Perform an asynchronous write of the data.
MS_SYNC. Perform a synchronous write of the data.
MS_INVALIDATE. Invalidate any cached data.

Thus, a call to mmap() followed by modification of the data followed by a call to msync() specifying the MS_SYNC flag is similar to a call to write() following a call to open() and specifying the O_SYNCflag. By specifying the MS_ASYNCflag, this is loosely synonymous to opening a file without the O_SYNC flag. However, calling msync() with the MS_ASYNCflag is likely to initiate the I/O while writing to a file without specifying O_SYNC or O_DSYNCcould result in data sitting in the system page or buffer cache for some time.

One unusual property of mapped files occurs when the pseudo device /dev/zerois mapped. As one would expect, this gives access to a contiguous set of zeroes covering any part of the mapping that is accessed. However, following a mapping of /dev/zero, if the process was to fork, the mapping would be visible by parent and child. If MAP_PRIVATEwas specified on the call to mmap(), parent and child will share the same physical pages of the mapping until a modification is made at which time the kernel will copy the page that makes the modification private to the process which issued the write.

If MAP_SHARED is specified, both parent and children will share the same physical pages regardless of whether read or write operations are performed.

64-Bit File Access (LFS)

32-bit operating systems have typically used a signed long integer as the offset to files. This leads to a maximum file size of 231 -1 (2GB - 1). The amount of work to convert existing applications to use a different size type for file offsets was considered too great, and thus the Large File Summit was formed, a group of OS and filesystem vendors who wanted to produce a specification that could allow access to large files. The specification would then be included as part of the Single UNIX Specification (UNIX 95 and onwards). The specification provided the following concepts:
The off_t data type would support one of two or more sizes as the OS and filesystem evolved to a full 64-bit solution.
An offset maximum which, as part of the interface, would give the maximum offset that the OS/filesystem would allow an application to use. The offset maximum is determined through a call to open() by specifying (or not) whether the application wishes to access large files.
When applications attempt to read parts of a file beyond their understanding of the offset maximum, the OS would return a new error code, namely EOVERFLOW.

In order to provide both an explicit means of accessing large files as well as a hidden and easily upgradable approach, there were two programmatic models. The first allowed the size of off_t to be determined during the compilation and linking process. This effectively sets the size of off_t and determines whether the standard system calls such as read() and write() will be used or whether the large file specific libraries will be used. Either way, the application continues to use read(), write(), and related system calls, and the mapping is done during the link time.

The second approach provided an explicit model whereby the size of off_t was chosen explicitly within the program. For example, on a 32-bit OS, the size of off_t would be 32 bits, and large files would need to be accessed through use of the off64_t data type. In addition, specific calls such as open64(), read64()would be required in order to access large files.

Today, the issue has largely gone away, with most operating systems supporting large files by default.

Sparse Files

Due to their somewhat rare usage, sparse files are often not well understood and a cause of confusion. For example, the VxFS filesystem up to version 3.5 allowed a maximum filesystem size of 1TB but a maximum file size of 2TB. How can a single file be larger than the filesystem in which it resides?

A sparse file is simply a file that contains one or more holes. This statement itself is probably the reason for the confusion. A hole is a gap within the file for which there are no allocated data blocks. For example, a file could contain a 1KB data block followed by a 1KB hole followed by another 1KB data block. The size of the file would be 3KB but there are only two blocks allocated. When reading over a hole, zeroes will be returned.

The following example shows how this works in practice. First of all, a 20MB filesystem is created and mounted:

 # mkfs -F vxfs /dev/vx/rdsk/rootdg/vol2 20m
 version 4 layout
 40960 sectors, 20480 blocks of size 1024, log size 1024 blocks
 unlimited inodes, largefiles not supported
 20480 data blocks, 19384 free data blocks
 1 allocation units of 32768 blocks, 32768 data blocks
 last allocation unit has 20480 data blocks
 # mount -F vxfs /dev/vx/dsk/rootdg/vol2 /mnt2
 
and the following program, which is used to create a new file, seeks to an offset of 64MB and then writes a single byte:
 #include < sys/types.h>
 #include < fcntl.h>
 #include < unistd.h>
 
 #define IOSZ (1024 * 1024 *64)
 
 main() {
 
  int fd;
  fd = open("/mnt2/newfile", O_CREAT | O_WRONLY, 0666)
  lseek(fd, IOSZ, SEEK_SET)
  write(fd, "a", 1)
 }
 
The following shows the result when the program is run:
 # ./lf
 # ls -l /mnt2
 total 
 drwxr-xr-x 2 root root 96 Jun 13 08:25 lost+found/
 -rw-r--r 1 root other 67108865 Jun 13 08:28 newfile
 # df -k | grep mnt2
 /dev/vx/dsk/rootdg/vol2 20480 1110 18167 6% /mnt2
 
And thus, the filesystem which is only 20MB in size contains a file which is 64MB. Note that, although the file size is 64MB, the actual space consumed is very low. The 6 percent usage, as displayed by running df, shows that the filesystem is mostly empty.

To help understand how sparse files can be useful, consider how storage is allocated to a file in a hypothetical filesystem. For this example, consider a filesystem that allocates storage to files in 1KB chunks and consider the interaction between the user and the filesystem as follows:

In this example, following the close()call, the file has a size of 2048 bytes. The data written to the file is stored in two 1k blocks. Now, consider the example below:

 User				 Filesystem 
 create() 			Create a new file 
 write(1k of as) 	Allocate a new 1k block for range 0 to 1023 bytes 
 write(1k of bs) 	Allocate a new 1k block for range 1024 to 2047 bytes 
 close() 			Close the file 
 
The chain of events here also results in a file of size 2048 bytes. However, by seeking to a part of the file that doesnt exist and writing, the allocation occurs at the position in the file as specified by the file pointer. Thus, a single 1KB block is allocated to the file. The two different allocations are shown in Figure 3.3.

Note that although filesystems will differ in their individual implementations, each file will contain a block map mapping the blocks that are allocated to the file and at which offsets. Thus, in Figure 3.3, the hole is explicitly marked.

So what use are sparse files and what happens if the file is read? All UNIX standards dictate that if a file contains a hole and data is read from a portion of a file containing a hole, zeroes must be returned. Thus when reading the sparse file above, we will see the same result as for a file created as follows:

 User 				Filesystem 
 create() 			Create a new file 
 write(1k of 0s) 	Allocate a new 1k block for range 1023 to 2047 bytes 
 write(1k of bs) 	Allocate a new 1k block for range 1024 to 2047 bytes 
 close()				Close the file 
 
Not all filesystems implement sparse files and, as the examples above show, from a programmatic perspective, the holes in the file are not actually visible. The main benefit comes from the amount of storage that is saved. Thus, if an application wishes to create a file for which large parts of the file contain zeroes, this is a useful way to save on storage and potentially gain on performance by avoiding unnecessary I/Os.

The following program shows the example described above:

 1 #include < sys/types.h> 
 2 #include < fcntl.h> 
 3 #include < unistd.h>
 5 main() 
 6 { 
 7 char buf[1024]; 
 8 int fd; 
 9 
 10 memset(buf, a, 1024); 
 11 fd = open("newfile", O_RDWR|O_CREAT|O_TRUNC, 0777); 
 12 lseek(fd, 1024, SEEK_SET); 
 13 write(fd, buf, 1024); 
 14 } 
 
When the program is run the contents are displayed as shown below. Note the zeroes for the first 1KB as expected.
 $ od -c newfile
 0000000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \
 0002000 a a a a a a a a a a a a a a a 
 0004000
 
If a write were to occur within the first 1KB of the file, the filesystem would have to allocate a 1KB block even if the size of the write is less than 1KB. For example, by modifying the program as follows:
 memset(buf, 'b', 512)
 fd = open("newfile", O_RDWR)
 lseek(fd, 256, SEEK_SET)
 write(fd, buf, 512)
 
and then running it on the previously created file, the resulting contents are:
 $ od -c newfile
 0000000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \
 0000400 b b b b b b b b b b b b b b b 
 0001400 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \
 0002000 a a a a a a a a a a a a a a a 
 0004000
 
Therefore in addition to allocating a new 1KB block, the filesystem must zero fill those parts of the block outside of the range of the write. The following example shows how this works on a VxFS filesystem. A new file is created. The program then seeks to byte offset 8192 and writes 1024 bytes.
 #include < sys/types.h>
 #include < fcntl.h>
 #include < unistd.h>
 
 main()
 {
 int fd; 
 char buf[1024]
 fd = open("myfile", O_CREAT | O_WRONLY, 0666)
 lseek(fd, 8192, SEEK_SET)
 write(fd, buf, 1024)
 }
 
In the output shown below, the program is run, the size of the new file is displayed, and the inode number of the file is obtained:
 # ./sparse
 # ls -l myfile
 -rw-r--r 1 root other 9216 Jun 13 08:37 myfile
 # ls -i myfile
 6 myfile
 
The VxFS fsdb command can show which blocks are assigned to the file. The inode corresponding to the file created is displayed:
 # umount /mnt2
 # fsdb -F vxfs /dev/vx/rdsk/rootdg/vol2
 # > 6i
 inode structure at 0x00000431.0200
 type IFREG mode 100644 nlink 1 uid 0 gid 1 size 9216
 atime 992447379 122128 (Wed Jun 13 08:49:39 2001)
 mtime 992447379 132127 (Wed Jun 13 08:49:39 2001)
 ctime 992447379 132127 (Wed Jun 13 08:49:39 2001)
 aflags 0 orgtype 1 eopflags 0 eopdata 
 fixextsize/fsindex 0 rdev/reserve/dotdot/matchino 
 blocks 1 gen 844791719 version 0 13 iattrino 
 de: 0 1096 0 0 0 0 0 0 0 
 des: 8 1 0 0 0 0 0 0 0 
 ie: 0 
 ies: 
 
The de field refers to a direct extent (filesystem block) and the des field is the extent size. For this file the first extent starts at block 0 and is 8 blocks (8KB) in size. VxFS uses block 0 to represent a hole (note that block 0 is never actually used). The next extent starts at block 1096 and is 1KB in length. Thus, although the file is 9KB in size, it has only one 1KB block allocated to it.

Summary

This chapter provided an introduction to file I/O based system calls. It is important to grasp these concepts before trying to understand how filesystems are implemented. By understanding what the user expects, it is easier to see how certain features are implemented and what the kernel and individual filesystems are trying to achieve.

Whenever programming on UNIX, it is always a good idea to follow appropriate standards to allow programs to be portable across multiple versions of UNIX. The commercial versions of UNIX typically support the Single UNIX Specification standard although this is not fully adopted in Linux and BSD. At the very least, all versions of UNIX will support the POSIX.1 standard.


 
 

CHAPTER4

The Standard I/O Library

Many users require functionality above and beyond what is provided by the basic file access system calls. The standard I/O library, which is part of the ANSI C standard, provides this extra level of functionality, avoiding the need for duplication in many applications.

There are many books that describe the calls provided by the standard I/O library (stdio). This chapter offers a different approach by describing the implementation of the Linux standard I/O library showing the main structures, how they support the functions available, and how the library calls map onto the system call layer of UNIX.

The needs of the application will dictate whether the standard I/O library will be used as opposed to basic file-based system calls. If extra functionality is required and performance is not paramount, the standard I/O library, with its rich set of functions, will typically meet the needs of most programmers. If performance is key and more control is required over the execution of I/O, understanding how the filesystem performs I/O and bypassing the standard I/O library is typically a better choice.

Rather than describing the myriad of stdio functions available, which are well documented elsewhere, this chapter provides an overview of how the standard I/O library is implemented. For further details on the interfaces available, see Richard Stevens book Advanced Programming in the UNIX Programming Environment [STEV92] or consult the Single UNIX Specification.

The FILE Structure

Where system calls such as open() and dup() return a file descriptor through which the file can be accessed, the stdio library operates on a FILE structure, or file stream as it is often called. This is basically a character buffer that holds enough information to record the current read and write file pointers and some other ancillary information. On Linux, the IO_FILE structure from which the FILE structure is defined is shown below. Note that not all of the structure is shown here.
 struct _IO_FILE 
 {
 	char *_IO_read_ptr; /* Current read pointer */ 
 	char *_IO_read_end; /* End of get area. */ 
 	char *_IO_read_base; /* Start of putback and get area. */ 
 	char *_IO_write_base; /* Start of put area. */ 
 	char *_IO_write_ptr; /* Current put pointer. */ 
 	char *_IO_write_end; /* End of put area. */ 
 	char *_IO_buf_base; /* Start of reserve area. */ 
 	char *_IO_buf_end; /* End of reserve area. */ 
 	int _fileno; 
 	int _blksize; 
 }; 
 
 typedef struct _IO_FILE FILE;
 
Each of the structure fields will be analyzed in more detail throughout the chapter. However, first consider a call to the open() and read()system calls:
 fd = open("/etc/passwd", O_RDONLY)
 read(fd, buf, 1024)
 
When accessing a file through the stdio library routines, a FILEstructure will be allocated and associated with the file descriptor fd, and all I/O will operate through a single buffer. For the _IO_FILE structure shown above, _fileno is used to store the file descriptor that is used on subsequent calls to read() or write(), and _IO_buf_base represents the buffer through which the data will pass.

Standard Input, Output, and Error

The standard input, output, and error for a process can be referenced by the file descriptors STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO. To use the stdio library routines on either of these files, their corresponding file streams stdin, stdout, and stderr can also be used. Here are the definitions of all three:

 extern FILE *stdin;
 extern FILE *stdout;
 extern FILE *stderr;
 
All three file streams can be accessed without opening them in the same way that the corresponding file descriptor values can be accessed without an explicit call to open().

There are some standard I/O library routines that operate on the standard input and output streams explicitly. For example, a call to printf() uses stdin by default whereas a call to fprintf() requires the caller to specify a file stream. Similarly, a call to getchar() operates on stdin while a call to getc() requires the file stream to be passed. The declaration of getchar()could simply be:

 #define getchar() getc(stdin)
 

Opening and Closing a Stream

The fopen() and fclose() library routines can be called to open and close a file stream:
 #include < stdio.h>
 
 FILE *fopen(const char *filename, const char *mode)
 int fclose(FILE *stream)
 
The mode argument points to a string that starts with one of the following sequences. Note that these sequences are part of the ANSI C standard.
r, rb.Open the file for reading.
w, wb.Truncate the file to zero length or, if the file does not exist, create a new file and open it for writing.
a, ab. Append to the file. If the file does not exist, it is first created.
r+, rb+, r+b. Open the file for update (reading and writing).
w+, wb+, w+b. Truncate the file to zero length or, if the file does not exist, create a new file and open it for update (reading and writing).
a+, ab+, a+b. Append to the file. If the file does not exist it is created and opened for update (reading and writing). Writing will start at the end of file.

Internally, the standard I/O library will map these flags onto the corresponding flags to be passed to the open() system call. For example, r will map to O_RDONLY, r+ will map to O_RDWR and so on. The process followed when opening a stream is shown in Figure 4.1.

The following example shows the effects of some of the library routines on the FILE structure:

 1 #include < stdio.h>
 2 
 3 main() 
 4 { 
 5 FILE *fp1, *fp2; 
 6 char c; 
 7 
 8 fp1 = fopen("/etc/passwd", "r")
 9 fp2 = fopen("/etc/mtab", "r")
 10 printf("address of fp1 = 0x%x\n", fp1)
 11 printf(" fp1->_fileno = 0x%x\n", fp1->_fileno)
 12 printf("address of fp2 = 0x%x\n", fp2)
 13 printf(" fp2->_fileno = 0x%x\n\n", fp2->_fileno)
 14
 15 c = getc(fp1)
 16 c = getc(fp2)
 17 printf(" fp1->_IO_buf_base = 0x%x\n"
 18 fp1->_IO_buf_base)
 19 printf(" fp1->_IO_buf_end = 0x%x\n"
 20 fp1->_IO_buf_end)
 21 printf(" fp2->_IO_buf_base = 0x%x\n"
 22 fp2->_IO_buf_base)
 23 printf(" fp2->_IO_buf_end = 0x%x\n"
 24 fp2->_IO_buf_end)
 25 }
 
Note that, even following a call to fopen(), the library will not allocate space to the I/O buffer unless the user actually requests data to be read or written. Thus, the value of _IO_buf_base will initially be NULL. In order for a buffer to be allocated in the program here, a call is made to getc() in the above example, which will allocate the buffer and read data from the file into the newly allocated buffer.
 $ fpopen
 Address of fp1 = 0x8049860
 
 fp1->_fileno = 0x3
 Address of fp2 = 0x80499d0
 fp2->_fileno = 0x4
 
 fp1->_IO_buf_base = 0x40019000
 fp1->_IO_buf_end = 0x4001a000
 fp2->_IO_buf_base = 0x4001a000
 fp2->_IO_buf_end = 0x4001b000
 
Note that one can see the corresponding system calls that the library will make by running strace, trussetc.
 $ strace fpopen 2>&1 | grep open
 open("/etc/passwd", O_RDONLY) = 
 open("/etc/mtab", O_RDONLY) = 
 $ strace fpopen 2>&1 | grep read
 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 827
 read(4, "/dev/hda6 / ext2 rw 0 0 none /pr"..., 4096) = 157
 
Note that despite the programs request to read only a single character from each file stream, the stdio library attempted to read 4KB from each file. Any subsequent calls to getc() do not require another call to read() until all characters in the buffer have been read.

There are two additional calls that can be invoked to open a file stream, namely fdopen()and freopen():

 #include < stdio.h>
 
 FILE *fdopen (int fildes, const char *mode)
 FILE *freopen (const char *filename,
 const char *mode, FILE *stream)
 
The fdopen() function can be used to associate an already existing file stream with a file descriptor. This function is typically used in conjunction with functions that only return a file descriptor such as dup(), pipe(), and fcntl().

The freopen() function opens the file whose name is pointed to by filename and associates the stream pointed to by stream with it. The original stream (if it exists) is first closed. This is typically used to associate a file with one of the predefined streams, standard input, output, or error. For example, if the caller wishes to use functions such as printf() that operate on standard output by default, but also wants to use a different file stream for standard output, this function achieves the desired effect.

Standard I/O Library Buffering

The stdio library buffers data with the goal of minimizing the number of calls to the read() and write() system calls. There are three different types of buffering used:
Fully (block) buffered. As characters are written to the stream, they are buffered up to the point where the buffer is full. At this stage, the data is written to the file referenced by the stream. Similarly, reads will result in a whole buffer of data being read if possible.
Line buffered. As characters are written to a stream, they are buffered up until the point where a newline character is written. At this point the line of data including the newline character is written to the file referenced by the stream. Similarly for reading, characters are read up to the point where a newline character is found.
Unbuffered. When an output stream is unbuffered, any data that is written to the stream is immediately written to the file to which the stream is associated.

The ANSI C standard dictates that standard input and output should be fully buffered while standard error should be unbuffered. Typically, standard input and output are set so that they are line buffered for terminal devices and fully buffered otherwise.

The setbuf()and setvbuf()functions can be used to change the buffering characteristics of a stream as shown:

 #include < stdio.h>
 
 void setbuf(FILE *stream, char *buf)
 int setvbuf(FILE *stream, char *buf, int type, size_t size)
 
The setbuf() function must be called after the stream is opened but before any I/O to the stream is initiated. The buffer specified by the buf argument is used in place of the buffer that the stdio library would use. This allows the caller to optimize the number of calls to read() and write() based on the needs of the application.

The setvbuf() function can be called at any stage to alter the buffering characteristics of the stream. The type argument can be one of _IONBF (unbuffered), _IOLBF (line buffered), or _IOFBF (fully buffered). The buffer specified by the bufargument must be at least sizebytes. Prior to the next I/O, this buffer will replace the buffer currently in use for the stream if one has already been allocated. If bufis NULL, only the buffering mode will be changed.

Whether full or line buffering is used, the fflush() function can be used to force all of the buffered data to the file referenced by the stream as shown:

 #include < stdio.h>
 
 int fflush(FILE *stream)
 
Note that all output streams can be flushed by setting stream to NULL. One further point worthy of mention concerns termination of a process. Any streams that are currently open are flushed and closed before the process exits.

Reading and Writing to/from a Stream

There are numerous stdio functions for reading and writing. This section describes some of the functions available and shows a different implementation of the cp program using various buffering options. The program shown below demonstrates the effects on the FILEstructure by reading a single character using the getc()function:
 1 #include < stdio.h>
 2 
 3 main() 
 4 { 
 5 FILE *fp; 
 6 char c; 
 7 
 8 fp = fopen("/etc/passwd", "r")
 9 printf("address of fp = 0x%x\n", fp)
 10 printf(" fp->_fileno = 0x%x\n", fp->_fileno)
 11 printf(" fp->_IO_buf_base = 0x%x\n", fp->_IO_buf_base)
 12 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
 13
 14 c = getc(fp)
 15 printf(" fp->_IO_buf_base = 0x%x (size = %d)\n"
 16 fp->_IO_buf_base,
 17 fp->_IO_buf_end fp->_IO_buf_base)
 18 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
 19 c = getc(fp)
 20 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
 21 }
 
Note as shown in the output below, the buffer is not allocated until the first I/O is initiated. The default size of the buffer allocated is 4KB. With successive calls to getc(), the read pointer is incremented to reference the next byte to read within the buffer. Figure 4.2 shows the steps that the stdio library goes through to read the data.
 $ fpinfo
 Address of fp = 0x8049818
 fp->_fileno = 0x3
 fp->_IO_buf_base = 0x0
 fp->_IO_read_ptr = 0x0
 fp->_IO_buf_base = 0x40019000 (size = 4096)
 fp->_IO_read_ptr = 0x40019001
 fp->_IO_read_ptr = 0x40019002
 
By running strace on Linux, it is possible to see how the library reads the data following the first call to getc(). Note that only those lines that reference the /etc/passwd file are displayed here:
 $ strace fpinfo
 ..
 open("/etc/passwd", O_RDONLY) = 
 ..
 fstat(3, st_mode=S_IFREG_0644, st_size=788, ...) = 
 ..
 read(3, "root:x:0:0:root:/root:/bin/bash\n"..., 4096) = 788
 
The call to fopen() results in a call to open()and the file descriptor returned is stored in fp->_fileno as shown above. Note that although the program only asked for a single character (line 14), the standard I/O library issued a 4KB read to fill up the buffer. The next call to getc() did not require any further data to be read from the file. Note that when the end of the file is reached, a subsequent call to getc() will return EOL.

The following example provides a simple cp program showing the effects of using fully buffered, line buffered, and unbuffered I/O. The buffering option is passed as an argument. The file to copy from and the file to copy to are hard coded into the program for this example.

 1 #include < time.h>
 2 #include < stdio.h>
 4 main(int argc, char **argv)
 5 {
 time_t time1, time2;
 7 FILE *ifp, *ofp; 
 8 int mode; 
 9 char c, ibuf[16384], obuf[16384]; 
 10 
 11 if (strcmp(argv[1], "_IONBF") == 0) {
 12 mode = _IONBF;
 13 } else if (strcmp(argv[1], "_IOLBF") == 0) {
 14 mode = _IOLBF;
 15 } else 
 16 mode = _IOFBF;
 17 
 18
 19 ifp = fopen("infile", "r")
 20 ofp = fopen("outfile", "w")
 21
 22 setvbuf(ifp, ibuf, mode, 16384)
 23 setvbuf(ofp, obuf, mode, 16384)
 24
 25 time(&time1)
 26 while ((c = fgetc(ifp)) != EOF) {
 27 fputc(c, ofp)
 28 }
 29 time(&time2)
 30 fprintf(stderr, "Time for %s was %d seconds\n", argv[1]
 31 time2 - time1)
 32 }
 
The input file has 68,000 lines of 80 characters each. When the program is run with the different buffering options, the following results are observed:
 $ ls -l infile
 -rw-r--r-1 spate fcf 5508000 Jun 29 15:38 infile
 $ wc -l infile
 68000 infile
 $ ./fpcp _IONBF
 Time for _IONBF was 35 seconds
 $ ./fpcp _IOLBF
 Time for _IOLBF was 3 seconds
 $ ./fpcp _IOFBF
 Time for _IOFBF was 2 seconds
 
The reason for such a huge difference in performance can be seen by the number of system calls that each option results in. For unbuffered I/O, each call to getc() or putc() produces a system call to read() or write(). All together, there are 68,000 reads and 68,000 writes! The system call pattern seen for unbuffered is as follows:
 ..
 open("infile", O_RDONLY) = 
 open("outfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 
 time([994093607]) = 994093607
 read(3, "0", 1) = 
 write(4, "0", 1) = 
 read(3, "1", 1) = 
 write(4, "1", 1) = 
 ..
 
For line buffered, the number of system calls is reduced dramatically as the system call pattern below shows. Note that data is still read in buffer-sized chunks.
 ..
 open("infile", O_RDONLY) = 
 open("outfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 
 time([994093688]) = 994093688
 read(3, "01234567890123456789012345678901"..., 16384) = 16384
 write(4, "01234567890123456789012345678901"..., 81) = 81
 write(4, "01234567890123456789012345678901"..., 81) = 81
 write(4, "01234567890123456789012345678901"..., 81) = 81
 ..
 
For the fully buffered case, all data is read and written in buffer size (16384 bytes) chunks, reducing the number of system calls further as the following output shows:
 open("infile", O_RDONLY) = 
 open("outfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 
 read(3, "67890123456789012345678901234567"..., 4096) = 4096
 write(4, "01234567890123456789012345678901"..., 4096) = 4096
 read(3, "12345678901234567890123456789012"..., 4096) = 4096
 write(4, "67890123456789012345678901234567"..., 4096) = 4096
 

Seeking through the Stream

Just as the lseek() system call can be used to set the file pointer in preparation for a subsequent read or write, the fseek() library function can be called to set the file pointer for the stream such that the next read or write will start from that offset.
 #include < stdio.h>
 
 int fseek(FILE *stream, long int offset, int whence)
 
The offset and whence arguments are identical to those supported by the lseek() system call. The following example shows the effect of calling fseek()on the file stream:
  1 #include < stdio.h>
  3 main()
  4 {
 5 FILE *fp; 
 6 char c; 
 7
 8 fp = fopen("infile", "r")
 9 printf("address of fp = 0x%x\n", fp)
 10 printf(" fp->_IO_buf_base = 0x%x\n", fp->_IO_buf_base)
 11 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
 12
 13 c = getc(fp)
 14 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
 15 fseek(fp, 8192, SEEK_SET)
 16 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
 17 c = getc(fp)
 18 printf(" fp->_IO_read_ptr = 0x%x\n", fp->_IO_read_ptr)
 19 }
 
 By calling getc(), a 4KB read is used to fill up the buffer pointed to by 
 _IO_buf_base. Because only a single character is returned by getc(), the read 
 pointer is only advanced by one. The call to fseek() modifies the read pointer as 
 shown below: 
 
 $ fpseek
 Address of fp = 0x80497e0
 fp->_IO_buf_base = 0x0
 fp->_IO_read_ptr = 0x0
 fp->_IO_read_ptr = 0x40019001
 fp->_IO_read_ptr = 0x40019000
 fp->_IO_read_ptr = 0x40019001
 
Note that no data needs to be read for the second call to getc(). Here are the relevant system calls:
 open("infile", O_RDONLY) = 
 fstat64(1, st_mode=S_IFCHR_0620, st_rdev=makedev(136, 0), ...) = 
 read(3, "01234567890123456789012345678901"..., 4096) = 4096
 write(1, ...) # display _IO_read_ptr
 _llseek(3, 8192, [8192], SEEK_SET) = 
 write(1, ...) # display _IO_read_ptr
 read(3, "12345678901234567890123456789012"..., 4096) = 4096
 write(1, ...) # display _IO_read_ptr
 
The first call to getc() results in the call to read(). Seeking through the stream results in a call to lseek(), which also resets the read pointer. The second call to getc()then involves another call to read data from the file.

There are four other functions available that relate to the file position within the stream, namely:

 #include < stdio.h>
 
 long ftell( FILE *stream)
 void rewind( FILE *stream)
 int fgetpos( FILE *stream, fpos_t *pos)
 int fsetpos( FILE *stream, fpos_t *pos)
 
The ftell()function returns the current file position. In the preceding example following the call to fseek(), a call to ftell() would return 8192. The rewind()function is simply the equivalent of calling:
 	fseek(stream, 0, SEEK_SET)
 
The fgetpos() and fsetpos() functions are equivalent to ftell() and fseek() (with SEEK_SET passed), but store the current file pointer in the argument referenced by pos.

Summary

There are numerous functions provided by the standard I/O library that often reduce the work of an application writer. By aiming to minimize the number of system calls, performance of some applications may be considerably improved. Buffering offers a great deal of flexibility to the application programmer by allowing finer control over how I/O is actually performed.

This chapter highlighted how the standard I/O library is implemented but stops short of describing all of the functions that are available. Richard Stevens book Advanced Programming in the UNIX Environment [STEV92] provides more details from a programming perspective. Herbert Schildts book The Annotated ANSI C Standard [SCHI93] provides detailed information on the stdio library as supported by the ANSI C standard.

Оставьте свой комментарий !

Ваше имя:
Комментарий:
Оба поля являются обязательными

 Автор  Комментарий к данной статье