Let us look at the two interfaces: that between kernel and user space, and that between application code and system code. We have three layers, with libc between the kernel and the application code.

3.1 Application programs and C library

The programmer who writes the canonical program

 #include <stdio.h>
 int main() {
         printf("Hello world!\n");
         return 0;
 }

programs for the libc interface. This C program calls the routine printf() that is part of the C library. A large part of the API (Application Program(ming) Interface) of the C library in a Unix environment is described by POSIX. The latest version of this standard is POSIX 1003.1-2001.

From the viewpoint of the application programmer (and from the viewpoint of the POSIX standard) there is no distinction between library routines and system calls. Kernel and C library together provide the services described.

Many things are handled by the C library itself - those are the things the user could have programmed himself, but need not since the author of the library did this job already. Maybe the presence of the library also saves some memory: many utilities can share common library code.

But for the basic things, starting programs, allocating memory, file I/O etc., the C library invokes the kernel.

3.2 Kernel and user space

The kernel provides certain services, and user space, that is, everything outside the kernel, both libraries and application programs, uses these. Programs in user space contain system calls that ask the kernel to do something, and the kernel does so, or returns an error code.

Application programs do not usually contain direct system calls. Instead, they use library calls and the library uses system calls. But an application program can construct a system call "by hand". For example, in order to use the system call _llseek (to seek in a file larger than 4 GB when lseek does not support that), one can write

 #include <linux/unistd.h>
 
 _syscall5(int, _llseek, unsigned int, fd,
           unsigned long, offset_high, unsigned long, offset_low,
           long long *, result, unsigned int, origin)
 
 long long
 my_llseek(unsigned int fd, unsigned long long offset, unsigned int origin) {
           long long result;
           int retval;
 
           retval = _llseek (fd, offset >> 32, offset & 0xffffffff,
                             &result, origin);
           return (retval == -1) ? -1 : result;
 }

This _syscall5 is a macro that expands to the definition of _llseek as system call, with a tiny wrapper to set errno if necessary and the routine my_llseek invokes this system call. Read the details in /usr/include/asm/unistd.h.

An alternative is to use the syscall() call. It allows one to invoke system calls by number. See syscall(2). For example, syscall(__NR_getuid) is equivalent to getuid():

 #include <stdio.h>
 #include <unistd.h>
 #include <sys/syscall.h>
 
 int main() {
         printf("%ld\n", syscall(__NR_getuid));
         return 0;
 }

3.3 Error return conventions

Typically, the kernel returns a negative value to indicate an error:

         return -EPERM;  /* Operation not permitted */

Typically, libc returns -1 to indicate an error, and sets the global variable errno to a value indicating what was wrong. Thus, one expects glueing code somewhat like

 int chdir(char *dir) {
         int res = sys_chdir(dir);
         if (res < 0 && res > -4096) {
                 errno = -res;
                 return -1;
         }
         return res;
 }

Such glueing code is automatically provided if one uses the kernel macros _syscallN, with N in 0..6 (for system calls with N parameters), defined in /usr/include/asm/unistd.h. These macros all end with a call of the macro __syscall_return defined for 2.6.14 as

 #define __syscall_return(type, res) \
 do { \
         if ((unsigned long)(res) >= (unsigned long)(-(128 + 1))) { \
                 errno = -(res); \
                 res = -1; \
         } \
         return (type) (res); \
 } while (0)

that is supposed to convert the error codes to -1. (The code is buggy: the -(128+1) and the accompanying comment suggest that this is a test for 1..128, but it is a test for 1..129, and in fact error numbers fill the range 1..131 for this kernel.)

Most system calls return positive values, but some can return arbitrary values, and it is impossible to see whether an error occurred.

3.4 Alternative C libraries

The canonical C library under Linux is glibc.

 % /lib/libc.so.6
 GNU C Library stable release version 2.2.5, by Roland McGrath et al.
 ...

But several other C libraries exist. In ancient times we had libc4, a library still used by the Mastodon distribution. It uses the a.out format for binaries instead of the newer ELF format.

In old times we had libc5. It is not much used anymore, but people sometimes like it because it is much smaller than glibc.

But these days we have a handful of other small libc's: uClibc, dietlibc, newlib, klibc.

The last one is still in an early stage of development. It is intended for early user space (see also below), when the Linux kernel has been booted but no filesystem has been found on disk yet. Thus, it should be tiny, and needs only a few functions.

All of the libraries uClibc, dietlibc and newlib are meant for embedded use. Especially uClibc is fairly complete. They are much smaller than glibc.

There are also projects to recreate all standard utilities in a minimal form. See, for example busybox.

Exercise Install and use uClibc. What is the difference in size compared to glibc of statically compiled binaries? Of dynamically compiled binaries? Is there a speed difference?

3.5 Initial userspace

In 2.5.46 the first version of early userspace was merged into the official kernel tree. One sees the effects mainly in the dmesg output

 -> /dev
 -> /dev/console
 -> /root

The subdirectory /usr of the kernel source tree is for early userspace stuff. In init/main.c there is the call populate_rootfs() that unpacks the cpio archive initramfs_data.cpio.gz into the kernel rootfs. There are only these three device nodes, no actual programs yet, since programs need a library, and klibc has not been merged yet.

For recent news, see initrd-dynamic (an alternative), patch1 (the first code to get merged), klibc.

If one wants to play with this, find kernel 2.5.64 and apply the above klibc patch. It contains a usr/root/hello.c program, that however is never invoked. Add an invocation in init/main.c before the prepare_namespace(), e.g.,

         unlock_kernel();
         system_running = 1;
 
         execve("/sbin/hello", argv_init, envp_init);
 
         prepare_namespace();
         if (open("/dev/console", O_RDWR, 0) < 0) ...

Make sure to give the program the right name, e.g,

         cpio_mkfile("usr/root/hello", "/sbin/hello", 0777, 0, 0);

in gen_init_cpio.c, the program used to generate initramfs_data.cpio.gz. (Of course one can also generate this cpio archive using cpio, and give it arbitrary contents. If the contents is large, a patch is needed that is first present in 2.5.65.)

If after booting you see Hi Ma! then hello.c executed successfully, but the boot will stop there, since this process is "init" and must never exit. (So, the "hello" program should do its work and then exec the real init.)

Note that the / here is in the initramfs filesystem. But after prepare_namespace() one has / in the root filesystem.

3.6 Libraries and binary formats

The binary files one meets in daily life are object files, executables and libraries.

Given the standard example hello.c, that contains something like

 main() { printf("Hello!\n"); }

one creates the object file hello.o by cc -c hello.c, or the executable hello by cc -o hello hello.c. Now this executable does not contain code for the printf() function. The command

 % ldd hello
         linux-gate.so.1 =>  (0xffffe000)
         libc.so.6 => /lib/tls/libc.so.6 (0x40036000)
         /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)

shows that this executable requires ld-linux.so.2 and libc.so.6 at run time. The former is a linker that will at startup time insert the address of the printf() routine (found in libc.so.6) into a table with function pointers. (For linux-gate.so.1, see the section on vsyscalls.) See also ld.so(8).

It is possible to produce complete executables, that do not require run-time linking by giving cc the -static flag: cc -static -o hello-static hello.c.

 % ldd hello-static
         not a dynamic executable
 % ls -l hello* | sort -n +4
 -rw-r--r--  1 aeb users      31 2004-10-16 13:40 hello.c
 -rw-r--r--  1 aeb users     848 2004-10-16 13:40 hello.o
 -rwxr-xr-x  1 aeb users    8647 2004-10-16 13:40 hello
 -rwxr-xr-x  1 aeb users 2189142 2004-10-16 13:40 hello-static
 % strip hello hello-static
 % ls -l hello hello-static
 -rwxr-xr-x  1 aeb users   2952 2004-10-16 13:41 hello
 -rwxr-xr-x  1 aeb users 388108 2004-10-16 13:41 hello-static

The strip utility removes the symbol table. Static executables are huge and usually needed only in emergency situations. For example, it is common to have a statically linked version sln of the ln utility, to set up links like /lib/libc.so.6 -> libc-2.3.2.so making the library name point at the actual library. (Changing such links should be done with ln -sf .., so that there never is a moment that the libc link points to nowhere. If one tries to go in two steps: remove old link, create new link, then the second step will fail with an ln that needs libc, and suddenly no command works anymore.) It is also common to have a statically linked /sbin/init.

Binary formats

Various binary formats exist, like a.out, COFF, and ELF. ELF is the modern format. Support for a.out is disappearing.

The linux libc4 (like libc.so.4.6.27) libraries use the a.out format. In 1995 the big changeover to ELF happened. The new libraries are called libc5 (like libc.so.5.0.9). Around 1997/1998 libc5, maintained by HJLu, was replaced by libc6, also known as glibc2, maintained by Ulrich Drepper.

The a.out format comes in several flavours, such as OMAGIC, NMAGIC, ZMAGIC, QMAGIC. The OMAGIC format is very compact, but program in this format cannot be swapped or demand paged because it has a non-page-aligned header. The ZMAGIC format has its .text section aligned to a 1024-byte boundary, allowing bmap(), provided the binary lives on a filesystem with 1024-byte blocks. It was superseded by the QMAGIC format, that has its .text section starting at offset 0 (so that it contains the header) but with the first page not mapped. The result is that QMAGIC binaries are 992 bytes smaller than ZMAGIC ones, and moreover allow one to trap dereference of NULL pointers.

The binary format of an executable must be understood by the kernel when it handles an exec() call. There are kernel configuration options CONFIG_BINFMT_AOUT, CONFIG_BINFMT_MISC, CONFIG_BINFMT_ELF etc. Support for certain types of binaries can also be compiled as a module.

The kernel has (in exec.c) a linked list formats and routines (un)register_binfmt() called by modules who want to announce that they support some binary format. The routine search_binary_handler() tries all of the registered modules, calling their load_binary() functions one by one until one returns success. If all fail, the first few bytes of the binary are used as a magic (decimal) number to request a module that was not loaded yet. For example, a ZMAGIC binary starts with the bytes 0b 01, giving octal 0413, decimal 267 and would cause the module binfmt-267 to be requested. (Details depend on kernel version.)

Example of an OMAGIC file

Let us create a small OMAGIC binary by hand. We need a 32-byte header with structure as given in <asm/a.out.h>.

 % cat exb.s
 ; aout header
 .LONG 0407      ; OMAGIC
 .LONG 12        ; length of text
 .LONG 0         ; length of data
 .LONG 0         ; length of bss
 .LONG 0         ; length of symbol table
 .LONG 0         ; start address
 .LONG 0         ; length of relocation info for text
 .LONG 0         ; length of relocation info for data
 
 ; actual program
 main:
         mov eax,#1    ; exit(42)
         mov ebx,#42
         int #0x80
 % as86 -3 -b exb exb.s
 % od -Ad -tx1 exb
 0000 07 01 00 00 0c 00 00 00 00 00 00 00 00 00 00 00
 0016 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 0032 b8 01 00 00 00 bb 2a 00 00 00 cd 80
 0044
 % objdump -d exb
 
 exb:     file format a.out-i386-linux
 
 Disassembly of section .text:
 
 0000000000000000 <.text>:
    0:   b8 01 00 00 00          mov    $0x1,%eax
    5:   bb 2a 00 00 00          mov    $0x2a,%ebx
    a:   cd 80                   int    $0x80
 % chmod +x exb
 % ./exb
 % echo $?
 42
 %

Thus we have a 44-byte binary that consists of a 32-byte header followed by a 12-byte program. (The statically linked translation of the equivalent C program main() { return 42; } takes 388076 bytes.)

That worked. Now avoid creating the header by hand.

 % cat exa.s
 export _main
 _main:
         mov eax,#1
         mov ebx,#42
         int #0x80
 % as86 -3 -o exa.o exa.s
 % ld86 -N -s -o exa exa.o
 % ./exa
 % echo $?
 42
 %

(Without the export line, ld86 will complain ld86: no start symbol.)

That was (almost) the same 44-byte binary - one header byte differs.

Producing OMAGIC from C source seems to be done using bcc -3 -N, but I do not have bcc here.

Shared and static libraries

Linking against a static library (with a name like foo.a) involves copying the code for the functions needed from that library at compile time. Linking against a dynamic library (with a name like foo.sa for a.out, or foo.so for ELF) involves finding references to the functions needed at compile time, so that these can be found in the right libraries at run time. (The files foo.sa are not the actual libraries, but contain values of global symbols and function addresses needed for run time linking. The actual library is probably called foo.so.1.2.3.) The utility ldd tells you what libraries a program needs.

Personality

Linux has the concept of personality of an executable (since 1.1.20). The purpose is to make the Linux environment more similar to some other environment, like BSD or SCO or Solaris or older Linux, so that foreign or old binaries have better chances of working without modification.

For example, the Linux select() system call will update its timeout parameter to reflect the amount of time left. With the STICKY_TIMEOUTS flag set in the personality this is not done, following BSD behaviour.

For example, dereferencing a NULL pointer is sign of a program bug and causes a segfault because 0 is not in mapped memory. But SVr4 maps page 0 read-only (filled with zeroes) and some programs depend on this. With the MMAP_PAGE_ZERO flag set in the personality Linux will also do this.

Or, for example, the Linux mmap() system call will nowadays randomize the assigned addresses as a defense against hacking attempts, but with the ADDR_NO_RANDOMIZE flag set in the personality this is not done (since 2.6.12).

The personality value is composed of a 2-byte value identifying the system (Linux, SVR4, SUNOS, HPUX etc), and a number of 1-bit flags. See <linux/personality.h>. The personality is inherited from the parent process, and changed using the personality() system call. The setarch utility is a convenient tool for starting a process with a given personality.

4. System Calls

4.1 System call numbers

System calls are identified by their numbers. The number of the call foo is __NR_foo. For example, the number of _llseek used above is __NR__llseek, defined as 140 in /usr/include/asm-i386/unistd.h. Different architectures have different numbers.

Often, the kernel routine that handles the call foo is called sys_foo. One finds the association between numbers and names in the sys_call_table, for example in arch/i386/kernel/entry.S.

Change

The world changes and system calls change. Since one must not break old binaries, the semantics associated to any given system call number must remain fully backwards compatible.

What happens in practice is one of two things: either one gets a new and improved system call with a new name and number, and the libc routine that used to invoke the old call is changed to use the new one, or the new call (with new number) gets the old name, and the old call gets "old" prefixed to its name.

For example, long ago user IDs had 16 bits, today they have 32. __NR_getuid is 24, and __NR_getuid32 is 199, and the former belongs to the 16-bit version of the call, the latter to the 32-bit version. Looking at the associated kernel routines, we find that these are sys_getuid16 and sys_getuid, respectively. (Thus, sys_getuid does not have number __NR_getuid.) Looking at glibc, we find code somewhat like

 int getuid32_available = UNKNOWN;
 
 uid_t getuid(void) {
         if (getuid32_available == TRUE)
                 return INLINE_SYSCALL(getuid32, 0);
         if (getuid32_available == UNKNOWN) {
                 uid_t res = INLINE_SYSCALL(getuid32, 0);
 
                 if (res == 0 || errno != ENOSYS) {
                         getuid32_available = TRUE;
                         return res;
                 }
                 getuid32_available = FALSE;
         }
         return INLINE_SYSCALL(getuid, 0);
 }

For an example where the name was moved and the old call got a name prefixed by "old", see __NR_oldolduname, __NR_olduname, __NR_uname, belonging to sys_olduname, sys_uname, sys_newuname, respectively. One also has __NR_oldstat, __NR_stat, __NR_stat64 belonging to sys_stat, sys_newstat, sys_stat64, respectively. And __NR_umount, __NR_umount2 belonging to sys_oldumount, sys_umount, respectively.

These moving names are confusing - now you have been warned: the system call with number __NR_foo does not always belong to the kernel routine sys_foo().

4.2 The call

What happens? The assembler for a call with 0 parameters (on i386) is

 #define _syscall0(type,name) \
 type name(void) \
 { \
 long __res; \
 __asm__ volatile ("int $0x80" \
         : "=a" (__res) \
         : "0" (__NR_##name)); \
 __syscall_return(type,__res); \
 }

Thus, the basic ingredient is the assembler instruction INT 0x80. This causes a programmed exception and calls the kernel system_call routine. Some relevant code fragments:

 /* include/asm-i386/hw_irq.h */
 #define SYSCALL_VECTOR              0x80
 
 /* arch/i386/kernel/traps.c */
         set_system_gate(SYSCALL_VECTOR,&system_call);
 
 /* arch/i386/kernel/entry.S */
 #define GET_CURRENT(reg) \
         movl $-8192, reg; \
         andl %esp, reg
 
 #define SAVE_ALL \
         cld; \
         pushl %es; \
         pushl %ds; \
         pushl %eax; \
         pushl %ebp; \
         pushl %edi; \
         pushl %esi; \
         pushl %edx; \
         pushl %ecx; \
         pushl %ebx; \
         movl $(__KERNEL_DS),%edx; \
         movl %edx,%ds; \
         movl %edx,%es;
 
 #define RESTORE_ALL     \
         popl %ebx;      \
         popl %ecx;      \
         popl %edx;      \
         popl %esi;      \
         popl %edi;      \
         popl %ebp;      \
         popl %eax;      \
 1:      popl %ds;       \
 2:      popl %es;       \
         addl $4,%esp;   \
 3:      iret;
 
 ENTRY(system_call)
         pushl %eax                      # save orig_eax
         SAVE_ALL
         GET_CURRENT(%ebx)
         testb $0x02,tsk_ptrace(%ebx)    # PT_TRACESYS
         jne tracesys
         cmpl $(NR_syscalls),%eax
         jae badsys
         call *SYMBOL_NAME(sys_call_table)(,%eax,4)
         movl %eax,EAX(%esp)             # save the return value
 
 ENTRY(ret_from_sys_call)
         cli                             # need_resched and signals atomic test
         cmpl $0,need_resched(%ebx)
         jne reschedule
         cmpl $0,sigpending(%ebx)
         jne signal_return
         RESTORE_ALL

We transfer execution to system_call, save the original value of the EAX register (it is the number of the system call), save all other registers, verify that we are not being traced (otherwise the tracer must be informed and entirely different things happen), make sure that the system call number is within range, and call the appropriate kernel routine from the table sys_call_table. Upon return we check a few things and when all is well restore the registers and call IRET to return from this INT.

(This was for the i386 architecture. All details differ on other architectures, but the basic idea is the same: store the syscall number and the syscall parameters somewhere the kernel can find them, in registers, on the stack, or in a known place of memory, do something that causes a transfer to kernel code, etc.)

4.3 System call parameters

On i386, the parameters of a system call are transported via registers. The system call number goes into %eax, the first parameter in %ebx, the second in %ecx, the third in %edx, the fourth in %esi, the fifth in %edi, the sixth in %ebp.

Ancient history

Earlier versions of Linux could handle only four or five system call parameters, and therefore the system calls select() (5 parameters) and mmap() (6 parameters) used to have a single parameter that was a pointer to a parameter block in memory. Since Linux 1.3.0 five parameters are supported (and the earlier select with memory block was renamed old_select), and since Linux 2.3.31 six parameters are supported (and the earlier mmap with memory block was succeeded by the new mmap2).

4.4 Error return

Above we said: typically, the kernel returns a negative value to indicate an error. But this would mean that any system call only can return positive values. Since the negative error returns are of the form -ESOMETHING, and the error numbers have small positive values, there is only a small negative error range. Thus

 #define __syscall_return(type, res) \
 do { \
         if ((unsigned long)(res) >= (unsigned long)(-125)) { \
                 errno = -(res); \
                 res = -1; \
         } \
         return (type) (res); \
 } while (0)

Here the range [-125,-1] is reserved for errors (the constant 125 is version and architecture dependent) and other values are OK.

What if a system call wants to return a small negative number and it is not an error? The scheduling priority of a process is set by setpriority() and read by getpriority(), and this value ranges from -20 (top priority) to 19 (lowest priority background job). The library routines with these names use these numbers, but the system call getpriority() returns 20 - P instead of P, moving the output interval to positive numbers only.

Or, similarly, the subfunctions PEEK* of ptrace return the contents of a memory word in the traced process, and any value is possible. However, the system call returns this value in the data argument, and glibc does something like

         res = sys_ptrace(request, pid, addr, &data);
         if (res >= 0) {
                 errno = 0;
                 res = data;
         }
         return res;

so that a user program has to do

         errno = 0;
         res = ptrace(PTRACE_PEEKDATA, pid, addr, NULL);
         if (res == -1 && errno != 0)
                 /* error */

4.5 Interrupted system calls

Above we saw in ret_from_sys_call the test on sigpending: if a signal arrived while we were executing kernel code, then just before returning from the system call we first call the user program's signal handler, and when this finishes return from the system call.

When a system call is slow and a signal arrives while it was blocked, waiting for something, the call is aborted and returns -EINTR, so that the library function will return -1 and set errno to EINTR. Just before the system call returns, the user program's signal handler is called.

(So, what is "slow"? Mostly those calls that can block forever waiting for external events; read and write to terminal devices, but not read and write to disk devices, wait, pause.)

This means that a system call can return an error while nothing was wrong. Usually one will want to redo the system call. That can be automated by installing the signal handler using a call to sigaction with the SA_RESTART flag set. The effect is that upon an interrupt the system call is aborted, the user program's signal handler is called, and afterwards the system call is restarted from the beginning.

Why is this not the default? It was, for a while, but often it is necessary to react to a signal while the reacting is not done by the signal handler itself. It is difficult to do nontrivial things in a signal handler since the rest of the program is in an unknown state, and most signal handlers just set a flag that is tested elsewhere.

A demo:

 #include <stdio.h>
 #include <errno.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <signal.h>
 
 int got_interrupt;
 
 void intrup(int dummy) {
         got_interrupt = 1;
 }
 
 void die(char *s) {
         printf("%s\n", s);
         exit(1);
 }
 
 int main() {
         struct sigaction sa;
         int n;
         char c;
 
         sa.sa_handler = intrup;
         sigemptyset(&sa.sa_mask);
         sa.sa_flags = 0;
         if (sigaction(SIGINT, &sa, NULL))
                 die("sigaction-SIGINT");
         sa.sa_flags = SA_RESTART;
         if (sigaction(SIGQUIT, &sa, NULL))
                 die("sigaction-SIGQUIT");
 
         got_interrupt = 0;
         n = read(0, &c, 1);
         if (n == -1 && errno == EINTR)
                 printf("read call was interrupted\n");
         else if (got_interrupt)
                 printf("read call was restarted\n");
 
         return 0;
 }

Here Ctrl-C will interrupt the read call, while after Ctrl-\ the read call is restarted.

4.6 Sysenter and the vsyscall page

It has been observed that a 2 GHz Pentium 4 was much slower than an 850 MHz Pentium III on certain tasks, and that this slowness is caused by the very large overhead of the traditional int 0x80 interrupt on a Pentium 4.

Some models of the i386 family do have faster ways to enter the kernel. On Pentium II there is the sysenter instruction. Also AMD has a syscall instruction. It would be good if these could be used.

Something else is that in some applications gettimeofday() is a done very often, for example for timestamping all transactions. It would be nice if it could be implemented with very low overhead.

One way of obtaining a fast gettimeofday() is by writing the current time in a fixed place, on a page mapped into the memory of all applications, and updating this location on each clock interrupt. These applications could then read this fixed location with a single instruction - no system call required.

There might be other data that the kernel could make available in a read-only way to the process, like perhaps the current process ID. A vsyscall is a "system" call that avoids crossing the userspace-kernel boundary.

Linux is in the process of implementing such ideas. Since Linux 2.5.53 there is a fixed page, called the vsyscall page, filled by the kernel. At kernel initialization time the routine sysenter_setup() is called. It sets up a non-writable page and writes code for the sysenter instruction if the CPU supports that, and for the classical int 0x80 otherwise. Thus, the C library can use the fastest type of system call by jumping to a fixed address in the vsyscall page.

Concerning gettimeofday(), a vsyscall version for the x86-64 is already part of the vanilla kernel. Patches for i386 exist. (An example of the kind of timing differences: John Stultz reports on an experiment where he measures gettimeofday() and finds 1.67 us for the int 0x80 way, 1.24 us for the sysenter way, and 0.88 us for the vsyscall.)

Some details

The kernel maps a page (0xffffe000-0xffffefff) in the memory of every process. (This is the one but last addressable page. The last is not mapped - maybe to avoid bugs related to wraparound.) We can read it:

 /* get vsyscall page */
 #include <unistd.h>
 #include <string.h>
 
 int main() {
         char *p = (char *) 0xffffe000;
         char buf[4096];
 #if 0
         write(1, p, 4096);
         /* this gives EFAULT */
 #else
         memcpy(buf, p, 4096);
         write(1, buf, 4096);
 #endif
         return 0;
 }

and if we do, find an ELF binary.

 % ./get_vsyscall_page > syspage
 % file syspage
 syspage: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), stripped
 % objdump -h syspage
 
 syspage:     file format elf32-i386
 
 Sections:
 Idx Name          Size      VMA       LMA       File off  Algn
   0 .hash         00000050  ffffe094  ffffe094  00000094  2**2
                   CONTENTS, ALLOC, LOAD, READONLY, DATA
   1 .dynsym       000000f0  ffffe0e4  ffffe0e4  000000e4  2**2
                   CONTENTS, ALLOC, LOAD, READONLY, DATA
   2 .dynstr       00000056  ffffe1d4  ffffe1d4  000001d4  2**0
                   CONTENTS, ALLOC, LOAD, READONLY, DATA
   3 .gnu.version  0000001e  ffffe22a  ffffe22a  0000022a  2**1
                   CONTENTS, ALLOC, LOAD, READONLY, DATA
   4 .gnu.version_d 00000038  ffffe248  ffffe248  00000248  2**2
                   CONTENTS, ALLOC, LOAD, READONLY, DATA
   5 .text         00000047  ffffe400  ffffe400  00000400  2**5
                   CONTENTS, ALLOC, LOAD, READONLY, CODE
   6 .eh_frame_hdr 00000024  ffffe448  ffffe448  00000448  2**2
                   CONTENTS, ALLOC, LOAD, READONLY, DATA
   7 .eh_frame     0000010c  ffffe46c  ffffe46c  0000046c  2**2
                   CONTENTS, ALLOC, LOAD, READONLY, DATA
   8 .dynamic      00000078  ffffe578  ffffe578  00000578  2**2
                   CONTENTS, ALLOC, LOAD, DATA
   9 .useless      0000000c  ffffe5f0  ffffe5f0  000005f0  2**2
                   CONTENTS, ALLOC, LOAD, DATA
 % objdump -d syspage
 
 syspage:     file format elf32-i386
 
 Disassembly of section .text:
 
 ffffe400 <.text>:
 ffffe400:       51                      push   %ecx
 ffffe401:       52                      push   %edx
 ffffe402:       55                      push   %ebp
 ffffe403:       89 e5                   mov    %esp,%ebp
 ffffe405:       0f 34                   sysenter 
 ffffe407:       90                      nop    
 ffffe408:       90                      nop    
         ... more nops ...
 ffffe40d:       90                      nop    
 ffffe40e:       eb f3                   jmp    0xffffe403
 ffffe410:       5d                      pop    %ebp
 ffffe411:       5a                      pop    %edx
 ffffe412:       59                      pop    %ecx
 ffffe413:       c3                      ret    
         ... zero bytes ...
 ffffe420:       58                      pop    %eax
 ffffe421:       b8 77 00 00 00          mov    $0x77,%eax
 ffffe426:       cd 80                   int    $0x80
 ffffe428:       90                      nop    
 ffffe429:       90                      nop    
         ... more nops ...
 ffffe43f:       90                      nop    
 ffffe440:       b8 ad 00 00 00          mov    $0xad,%eax
 ffffe445:       cd 80                   int    $0x80

The interesting addresses here are found via

 % grep ffffe System.map 
 ffffe000 A VSYSCALL_BASE
 ffffe400 A __kernel_vsyscall
 ffffe410 A SYSENTER_RETURN
 ffffe420 A __kernel_sigreturn
 ffffe440 A __kernel_rt_sigreturn
 %

So __kernel_vsyscall pushes a few registers and does a sysenter instruction. And SYSENTER_RETURN pops the registers again and returns. And __kernel_sigreturn and __kernel_rt_sigreturn do system calls 119 and 173, that is, sigreturn and rt_sigreturn, respectively.

What about the jump just before SYSENTER_RETURN? It is a trick to handle restarting of system calls with 6 parameters. As Linus said: I'm a disgusting pig, and proud of it to boot.

The code involved is most easily seen from a slightly earlier patch.

A tiny demo program.

 #include <stdio.h>
 
 int pid;
 
 int main() {
         __asm__(
                 "movl $20, %eax    \n"
                 "call 0xffffe400   \n"
                 "movl %eax, pid    \n"
         );
         printf("pid is %d\n", pid);
         return 0;
 }

This does the getpid() system call (__NR_getpid is 20) using call 0xffffe400 instead of int 0x80.

However, the proper thing to do is not call 0xffffe400 but call *%gs:0x18. If %gs has been set up so that it addresses 0xffffe000, then at location 0xffffe018 we find the value of __kernel_vsyscall, the entry point of the kernel vsyscalls. Such general setup requires the parsing of the ELF headers of this vsyscall page, but then is future-proof.

5. Signals

The common communication channel between user space program and kernel is given by the system calls. But there is a different channel, that of the signals, used both between user processes and from kernel to user process.

5.1 Sending signals

A program can signal a different program using the kill() system call with prototype

 int kill(pid_t pid, int sig);

This will send the signal with number sig to the process with process ID pid. Signal numbers are small positive integers. (For the definitions on your machine, try /usr/include/bits/signum.h. Note that these definitions depend on OS and architecture.)

A user can send a signal from the command line using the kill command. Common uses are kill -9 N to kill the process with pid N, or kill -1 N to force process N (maybe init or inetd) to reread its configuration file.

Certain user actions will make the kernel send a signal to a process or group of processes: typing the interrupt character (probably Ctrl-C) causes SIGINT to be sent, typing the quit character (probably Ctrl-\) sends SIGQUIT, hanging up the phone (modem) sends SIGHUP, typing the stop character (probably Ctrl-Z) sends SIGSTOP.

Certain program actions will make the kernel send a signal to that process: for an illegal instruction one gets SIGILL, for accessing nonexisting memory one gets SIGSEGV, for writing to a pipe while nobody is listening anymore on the other side one gets SIGPIPE, for reading from the termnal while in the background one gets SIGTTIN, etc.

More interesting communication from the kernel is also possible. One can ask the kernel to be notified when something happens on a given file descriptor. See fcntl(2). And then there is ptrace(2) - see below.

A whole group of signals is reserved for real-time use.

5.2 Receiving signals

When a process receives a signal, a default action happens, unless the process has arranged to handle the signal. For the list of signals and the corresponding default actions, see signal(7). For example, by default SIGHUP, SIGINT, SIGKILL will kill the process; SIGQUIT will kill the process and force a core dump; SIGSTOP, SIGTTIN will stop the process; SIGCONT will continue a stopped process; SIGCHLD will be ignored.

Traditionally, one sets up a handler for the signal using the signal system call with prototype

 typedef void (*sighandler_t)(int);
 sighandler_t signal(int sig, sighandler_t handler);

This sets up the routine handler() as handler for signals with number sig. The return value is (the address of) the old handler. The special values SIG_DFL and SIG_IGN denote the default action and ignoring, respectively.

When a signal arrives, the process is interrupted, the current registers are saved, and the signal handler is invoked. When the signal handler returns, the interrupted activity is continued.

It is difficult to do interesting things in a signal handler, because the process can be interrupted in an arbitrary place, data structures can be in arbitrary state, etc. The three most common things to do in a signal handler are (i) set a flag variable and return immediately, and (ii) (messy) throw away all the program was doing, and restart at some convenient point, perhaps the main command loop or so, and (iii) clean up and exit.

Setting up a handler for a signal is called "catching the signal". The signals SIGKILL and SIGSTOP cannot be caught or blocked or ignored.

5.3 Semantics

The traditional semantics was: reset signal behaviour to SIG_DFL upon invocation of the signal handler. Possibly this was done to avoid recursive invocations. The signal handler would do its job and at the end call signal() to establish itself again as handler.

This is really unfortunate. When two signals arrive shortly after each other, the second one will be lost if it arrives before the signal handler is called - there is no counter. And if it arrives after the signal handler is called, the default action will happen - this may very well kill the process. Even if the handler calls signal() again as the very first thing it does, that may be too late.

Various Unix flavours played a bit with the semantics to improve on this situation. Some block signals as long as the process has not returned from the handler. The BSD solution was to invent a new system call, sigaction() where one can precisely specify the desired behaviour. Today signal() must be regarded as deprecated - not to be used in serious applications.

5.4 Blocking signals

Each process has a list (bitmask) of currently blocked signals. When a signal is blocked, it is not delivered (that is, no signal handling routine is called), but remains pending.

The sigprocmask() system call serves to change the list of blocked signals. See sigprocmask(2).

The sigpending() system call reveals what signals are (blocked and) pending.

The sigsuspend() system call suspends the calling process until a specified signal is received.

When a signal is blocked, it remains pending, even when otherwise the process would ignore it.

5.5 Voodoo: wait and SIGCHLD

When a process forks off a child to perform some task, it is probably interested in how things went. Upon exit, the child leaves an exit status that should be returned to the parent. So, when the child finishes it becomes a zombie - a process that is dead already but does not disappear yet because it has not yet reported its exit status.

Whenever something interesting happens to the child (it exits, crashes, traps, stops, continues), and in particular when it dies, the parent is sent a SIGCHLD signal.

The parent can use the system call wait() or waitpid() or so, there are a few variations, to learn about the status of its stopped or deceased children. In the case of a deceased child, as soon as a status has been reported, the zombie vanishes.

If the parent is not interested it can say so explicitly (before the fork) using

 signal(SIGCHLD, SIG_IGN);

 struct sigaction act;
 act.sa_handler = something;
 act.sa_flags = SA_NOCLDWAIT;
 sigaction (SIGCHLD, &act, NULL);

and as a result it will not hear about deceased children, and children will not be transformed into zombies. Note that the default action for SIGCHLD is to ignore this signal; nevertheless signal(SIGCHLD, SIG_IGN) has effect, namely that of preventing the transformation of children into zombies. In this situation, if the parent does a wait(), this call will return only when all children have exited, and then returns -1 with errno set to ECHILD.

It depends on the Unix flavor whether SIGCHLD is sent when SA_NOCLDWAIT was set. After act.sa_flags = SA_NOCLDSTOP no SIGCHLD is sent when children stop or stopped children continue.

If the parent exits before the child, then the child is reparented to init, process 1, and this process will reap its status.

5.6 Returning from a signal handler

When the program was interrupted by a signal, its status (including all integer and floating point registers) was saved, to be restored just before execution continues at the point of interruption.

This means that the return from the signal handler is more complicated than an arbitrary procedure return - the saved state must be restored.

To this end, the kernel arranges that the return from the signal handler causes a jump to a short code sequence (sometimes called trampoline) that executes a sigreturn() system call. This system call takes care of everything.

In the old days the trampoline lived on the stack, but nowadays (since 2.5.69) we have a trampoline in the vsyscall page, so that this trampoline no longer is an obstacle in case one wants a non-executable stack.

5.7 ptrace

For debugging purposes, the ptrace() system call was introduced. A process can trace a different process, examine or change its memory, see the system calls done or change them, etc. The way this is implemented is that the tracing process is notified each time the traced process does something interesting. Always interesting is the reception of signals. When the tracing process specifies this, also excution of system calls, or execution of any instruction is interesting.

Thus, one has interactive debuggers like gdb, and tracers like strace. This call is also very useful for hacking purposes. One can make one's own program attach to some utility and subtly change its workings, while the binary of the utility is unchanged, still has the correct date stamps and md5sum.

Exercise Write a program that attaches itself to a process with specified pid, and watches its read() calls; whenever a specified string occurs, change it into something else. Be otherwise invisible.

Exercise Write a program that attaches itself to a shell and watches the commands given. Whenever a specified binary must be executed, execute a different specified binary instead. Be otherwise invisible.

Below a baby example of the use of ptrace. This program will list the system calls done by some existing process (call ./ptrace -p N) or some subprocess (call ./ptrace some_command, e.g., ./ptrace /bin/ls -l). The version below will work only on i386, and only for relatively recent kernels.

 /*
  * ptrace a child - baby demo example - i386 only
  *
  * call: "ptrace command args" to trace command
  *       "ptrace -p N" to trace process with pid N
  */
 #include <errno.h>
 #include <stdio.h>
 #include <string.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <signal.h>
 #include <sys/wait.h>
 #include <sys/types.h>
 #include <sys/ptrace.h>
 #include <linux/user.h>      /* ORIG_EAX */
 #include <linux/unistd.h>    /* __NR_exit */
 #include <linux/ptrace.h>    /* EAX */
 
 #define SIZE(a) (sizeof(a)/sizeof((a)[0]))
 
 /*
  * syscall names for i386 under 2.5.51, taken from <asm/unistd.h>
  */
 char *(syscall_names[256]) = {
 "exit", "fork", "read", "write", "open", "close", "waitpid", "creat",
 "link", "unlink", "execve", "chdir", "time", "mknod", "chmod",
 "lchown", "break", "oldstat", "lseek", "getpid", "mount", "umount",
 "setuid", "getuid", "stime", "ptrace", "alarm", "oldfstat", "pause",
 "utime", "stty", "gtty", "access", "nice", "ftime", "sync", "kill",
 "rename", "mkdir", "rmdir", "dup", "pipe", "times", "prof", "brk",
 "setgid", "getgid", "signal", "geteuid", "getegid", "acct", "umount2",
 "lock", "ioctl", "fcntl", "mpx", "setpgid", "ulimit", "oldolduname",
 "umask", "chroot", "ustat", "dup2", "getppid", "getpgrp", "setsid",
 "sigaction", "sgetmask", "ssetmask", "setreuid", "setregid",
 "sigsuspend", "sigpending", "sethostname", "setrlimit", "getrlimit",
 "getrusage", "gettimeofday", "settimeofday", "getgroups", "setgroups",
 "select", "symlink", "oldlstat", "readlink", "uselib", "swapon",
 "reboot", "readdir", "mmap", "munmap", "truncate", "ftruncate",
 "fchmod", "fchown", "getpriority", "setpriority", "profil", "statfs",
 "fstatfs", "ioperm", "socketcall", "syslog", "setitimer", "getitimer",
 "stat", "lstat", "fstat", "olduname", "iopl", "vhangup", "idle",
 "vm86old", "wait4", "swapoff", "sysinfo", "ipc", "fsync", "sigreturn",
 "clone", "setdomainname", "uname", "modify_ldt", "adjtimex",
 "mprotect", "sigprocmask", "create_module", "init_module",
 "delete_module", "get_kernel_syms", "quotactl", "getpgid", "fchdir",
 "bdflush", "sysfs", "personality", "afs_syscall", "setfsuid",
 "setfsgid", "_llseek", "getdents", "_newselect", "flock", "msync",
 "readv", "writev", "getsid", "fdatasync", "_sysctl", "mlock",
 "munlock", "mlockall", "munlockall", "sched_setparam",
 "sched_getparam", "sched_setscheduler", "sched_getscheduler",
 "sched_yield", "sched_get_priority_max", "sched_get_priority_min",
 "sched_rr_get_interval", "nanosleep", "mremap", "setresuid",
 "getresuid", "vm86", "query_module", "poll", "nfsservctl",
 "setresgid", "getresgid", "prctl","rt_sigreturn","rt_sigaction",
 "rt_sigprocmask", "rt_sigpending", "rt_sigtimedwait",
 "rt_sigqueueinfo", "rt_sigsuspend", "pread", "pwrite", "chown",
 "getcwd", "capget", "capset", "sigaltstack", "sendfile", "getpmsg",
 "putpmsg", "vfork", "ugetrlimit", "mmap2", "truncate64",
 "ftruncate64", "stat64", "lstat64", "fstat64", "lchown32", "getuid32",
 "getgid32", "geteuid32", "getegid32", "setreuid32", "setregid32",
 "getgroups32", "setgroups32", "fchown32", "setresuid32",
 "getresuid32", "setresgid32", "getresgid32", "chown32", "setuid32",
 "setgid32", "setfsuid32", "setfsgid32", "pivot_root", "mincore",
 "madvise", "getdents64", "fcntl64", 0, "security", "gettid",
 "readahead", "setxattr", "lsetxattr", "fsetxattr", "getxattr",
 "lgetxattr", "fgetxattr", "listxattr", "llistxattr", "flistxattr",
 "removexattr", "lremovexattr", "fremovexattr", "tkill", "sendfile64",
 "futex", "sched_setaffinity", "sched_getaffinity",
 };
 
 void my_ptrace_void(int request, pid_t pid, void *addr, void *data) {
         int i = ptrace(request, pid, addr, data);
         if (i) {
                 perror("ptrace");
                 exit(1);
         }
 }
 
 /*
  * Since -1 may be valid data, we have to check errno.
  */
 int my_ptrace_read(int request, pid_t pid, void *addr, void *data) {
         int i;
 
         errno = 0;
         i = ptrace(request, pid, addr, data);
         if (i == -1 && errno) {
                 perror("ptrace");
                 exit(1);
         }
         return i;
 }
 
 pid_t pid;      /* the traced program */
 
 /* detach from traced program when interrupted */
 void interrupt(int dummy) {
         ptrace(PTRACE_DETACH, pid, 0, 0);
         exit(-1);
 }
 
 int got_sig = 0;
 
 void sigusr1(int dummy) {
         got_sig = 1;
 }
 
 void my_kill(pid_t pid, int sig) {
         int i = kill(pid, sig);
         if (i) {
                 perror("kill");
                 exit(1);
         }
 }
 
 /*
  * A child stopped at a syscall has status as if it received SIGTRAP.
  * In order to distinguish between SIGTRAP and syscall, some kernel
  * versions have the PTRACE_O_TRACESYSGOOD option, that sets an extra
  * bit 0x80 in the syscall case.
  */
 #define SIGSYSTRAP      (SIGTRAP | sysgood_bit)
 
 int sysgood_bit = 0;
 
 void set_sysgood(pid_t p) {
 #ifdef PTRACE_O_TRACESYSGOOD
         int i = ptrace(PTRACE_SETOPTIONS, p, 0, (void*) PTRACE_O_TRACESYSGOOD);
         if (i == 0)
                 sysgood_bit = 0x80;
         else
                 perror("PTRACE_O_TRACESYSGOOD");
 #endif
 }
 
 #define EXPECT_EXITED   1
 #define EXPECT_SIGNALED 2
 #define EXPECT_STOPPED  4
 
 void my_wait(pid_t p, int report, int stopsig) {
         int status;
         pid_t pw = wait(&status);
 
         if (pw == (pid_t) -1) {
                 perror("wait");
                 exit(1);
         }
 
         /*
          * Report only unexpected things.
          *
          * The conditions WIFEXITED, WIFSIGNALED, WIFSTOPPED
          * are mutually exclusive:
          * WIFEXITED:  (status & 0x7f) == 0, WEXITSTATUS: top 8 bits
          * and now WCOREDUMP:  (status & 0x80) != 0
          * WIFSTOPPED: (status & 0xff) == 0x7f, WSTOPSIG: top 8 bits
          * WIFSIGNALED: all other cases, (status & 0x7f) is signal.
          */
         if (WIFEXITED(status) && !(report & EXPECT_EXITED))
                 fprintf(stderr, "child exited%s with status %d\n",
                         WCOREDUMP(status) ? " and dumped core" : "",
                         WEXITSTATUS(status));
         if (WIFSTOPPED(status) && !(report & EXPECT_STOPPED))
                 fprintf(stderr, "child stopped by signal %d\n",
                         WSTOPSIG(status));
         if (WIFSIGNALED(status) && !(report & EXPECT_SIGNALED))
                 fprintf(stderr, "child signalled by signal %d\n",
                         WTERMSIG(status));
 
         if (WIFSTOPPED(status) && WSTOPSIG(status) != stopsig) {
                 /* a different signal - send it on and wait */
                 fprintf(stderr, "Waited for signal %d, got %d\n",
                         stopsig, WSTOPSIG(status));
                 if ((WSTOPSIG(status) & 0x7f) == (stopsig & 0x7f))
                         return;
                 my_ptrace_void(PTRACE_SYSCALL, p, 0, (void*) WSTOPSIG(status));
                 return my_wait(p, report, stopsig);
         }
 
         if ((report & EXPECT_STOPPED) && !WIFSTOPPED(status)) {
                 fprintf(stderr, "Not stopped?\n");
                 exit(1);
         }
 }
 
 /*
  * print value when changed
  */
 void outlonghex(unsigned long old, unsigned long new) {
         if (old == new)
                 fprintf(stderr, "         ");
         else
                 fprintf(stderr, " %08lx", new);
 }
 
 int
 main(int argc, char **argv, char **envp){
         pid_t p0, p;
 
         if (argc <= 1) {
                 fprintf(stderr, "Usage: %s command args -or- %s -p pid\n",
                         argv[0], argv[0]);
                 exit(1);
         }
 
         if (argc >= 3 && !strcmp(argv[1], "-p")) {
                 pid = p = atoi(argv[2]);
 
                 signal(SIGINT, interrupt);
 
                 /*
                  * attach to specified process
                  */
                 my_ptrace_void(PTRACE_ATTACH, p, 0, 0);
                 my_wait(p, EXPECT_STOPPED, SIGSTOP);
                 set_sysgood(p);
 
                 /*
                  * we stopped the program in the middle of what it was doing
                  * continue it, and make it stop at the next syscall
                  */
                 my_ptrace_void(PTRACE_SYSCALL, p, 0, 0);
         } else {
                 void (*oldsig)(int);
 
                 /*
                  * fork off a child that executes the specified command
                  */
 
                 /*
                  * The parent process will send a signal to the child
                  * and do a wait() to wait until the child stops.
                  * If the signal arrives before the child has said
                  * PTRACE_TRACEME, then maybe the child is killed, or
                  * maybe the signal is ignored and we wait forever, or
                  * maybe the child is stopped but we are not tracing.
                  * So, let us arrange for the child to signal the parent
                  * when it has done the PTRACE_TRACEME.
                  */
 
                 /* prepare both parent and child for signal */
                 oldsig = signal(SIGUSR1, sigusr1);
                 if (oldsig == SIG_ERR) {
                         perror("signal");
                         exit(1);
                 }
 
                 /* child needs parent pid */
                 p0 = getpid();
 
                 p = fork();
                 if (p == (pid_t) -1) {
                         perror("fork");
                         exit(1);
                 }
 
                 if (p == 0) {   /* child */
                         my_ptrace_void(PTRACE_TRACEME, 0, 0, 0);
 
                         /* tell parent that we are ready */
                         my_kill(p0, SIGUSR1);
 
                         /* wait for parent to start tracing us */
                         while (!got_sig) ;
 
                         /*
                          * the first thing the parent will see is
                          *  119: sigreturn - the return from the signal handler
                          */
 
                         /* exec the given process */
                         argv[argc] = 0;
                         execve(argv[1], argv+1, envp);
                         exit(1);
                 }
 
                 /* wait for child to get ready */
                 while (!got_sig) ;
 
                 /*
                  * tell child that we got the signal
                  * this kill() will stop the child
                  */
                 my_kill(p, SIGUSR1);
                 my_wait(p, EXPECT_STOPPED, SIGUSR1);
                 set_sysgood(p);
 
                 my_ptrace_void(PTRACE_SYSCALL, p, 0, (void *) SIGUSR1);
         }
 
         /*
          * trace the victim's syscalls
          */
         while (1) {
                 int syscall;
                 struct user_regs_struct u_in, u_out;
 
                 my_wait(p, EXPECT_STOPPED, SIGSYSTRAP);
 
                 my_ptrace_void(PTRACE_GETREGS, p, 0, &u_in);
                 syscall = u_in.orig_eax;
 
                 fprintf(stderr, "SYSCALL %3d:", syscall);
                 outlonghex(-38, u_in.eax);      /* seems constant */
                 fprintf(stderr, " %08lx %08lx %08lx",
                         u_in.ebx, u_in.ecx, u_in.edx);
                 if (syscall-1 >= 0 && syscall-1 < SIZE(syscall_names) &&
                     syscall_names[syscall-1])
                         fprintf(stderr, " /%s", syscall_names[syscall-1]);
                 fprintf(stderr, "\n");
 
                 if (syscall == __NR_execve) {
                         long *regs = 0; /* relative address 0 in user area */
                         long eax;
 
                         my_ptrace_void(PTRACE_SYSCALL, p, 0, 0);
                         my_wait(p, EXPECT_STOPPED, SIGSYSTRAP);
 
                         /*
                          * For a successful execve we get one more trap
                          * But was this call successful?
                          */
                         eax = my_ptrace_read(PTRACE_PEEKUSER, p, &regs[EAX],0);
                         if (eax == 0) {
                                 fprintf(stderr, "SYSCALL execve, once more\n");
 
                                 /* the syscall return - no "good" bit */
                                 my_ptrace_void(PTRACE_SYSCALL, p, 0, 0);
                                 my_wait(p, EXPECT_STOPPED, SIGTRAP);
                         }
                 } else {
                         /* wait for syscall return */
                         my_ptrace_void(PTRACE_SYSCALL, p, 0, 0);
                         if (syscall == __NR_exit ||
                             syscall == __NR_exit_group) {
                                 my_wait(p, EXPECT_EXITED, 0);
                                 exit(0);
                         }
                         my_wait(p, EXPECT_STOPPED, SIGSYSTRAP);
                 }
 
                 my_ptrace_void(PTRACE_GETREGS, p, 0, &u_out);
                 fprintf(stderr, " RETURN %3d:", syscall);
                 outlonghex(u_in.eax, u_out.eax);
                 outlonghex(u_in.ebx, u_out.ebx);
                 outlonghex(u_in.ecx, u_out.ecx);
                 outlonghex(u_in.edx, u_out.edx);
                 fprintf(stderr, "\n");
 
                 my_ptrace_void(PTRACE_SYSCALL, p, 0, 0);
         }
         return 0;
 }

5.8 The Linux "parent death" signal

For each process there is a variable pdeath_signal, that is initialized to 0 after fork() or clone(). It gives the signal that the process should get when its parent dies.

This variable can be set using

         prctl(PR_SET_PDEATHSIG, sig);

and read out using

         prctl(PR_GET_PDEATHSIG, &sig);

This construct can be used in thread libraries: when the manager thread dies, all threads managed by it should clean up and exit. E.g.:

         prctl(PR_SET_PDEATHSIG, SIGHUP);        /* set pdeath sig */
         if (getppid() == 1)                     /* parent died already? */
                 kill(getpid(), SIGHUP);

6. File names and files

6.1 The file hierarchy

File naming in a Unix-like system can be described from the user space point of view and from the kernel point of view. On a Unix-like system, a user or user program sees a single file hierarchy, rooted at /. A full pathname of a file, like /usr/games/lib/hack/scores describes the path from the root of the tree to the given file. The current working directory is a marked node in the tree. Pathnames not starting with / are relative with respect to the cwd. (Thus, if the cwd is /home/aeb, then bin/findgr refers to /home/aeb/bin/findgr and ../asm refers to /home/asm.) The name . (when initial in a pathname) refers to the cwd. The names . and .. relative to some directory D refer to D and the parent directory of D, respectively.

With the original Unix, the empty string was a name for the current working directory. POSIX made this illegal.

That was the user picture. That files live in different filesystems on different devices is to a first approximation invisible. (If one looks more closely, then an NFS-mounted filesystem does not quite have Unix semantics. FAT filesystems do not have Unix attributes like owner and permissions. Etc.)

If she has the appropriate capabilities, the user can modify the big file hierarchy by mounting or unmounting block devices. Given a Zip disk or a compact flash card, or a floppy, or a CD or so, a command like

 
 
 mount -t vfat /dev/sda4 /zip

will attach the file hierarchy found on /dev/sda4 to the node named /zip, so that after this call /zip/a refers to the file /a on this disk, and /zip itself refers to / on this disk. In particular, owner and permissions of /zip will in general change. (Formulated differently: no inodes change, but after the mount the name /zip names a different inode.) The directory /zip is called the mount point.

Note that if the mount point was a nonempty directory, the files below it are no longer accessible using full pathnames through the mount point: If there was a /zip/a before the mount command, then this file can no longer be accessed by this name after the command. On the other hand, if some user process had current working directory /zip or /zip/b, this file will still be accessible using the name ./a (or just a) and ../a, respectively.

We see that the current working directory really is an inode, not a string.

Altogether, the interpretation of a filename depends on the root directory, the current working directory, and the list of mounts. The current working directory is private for each process, and a process can change its current working directory using the chdir() system call.

In ancient Unix, the root directory was the same for all processes. However, 4.2 BSD made the root directory private for each process, just like the current working directory, and introduced the chroot() system call. The chroot system call serves to specify which directory is pointed at by the users' /. It is often used in security-conscious environments, say by an ftp server, where the purpose is to keep anonymous users inside the chroot jail.

This gives a new example of a name space with interesting topology: the chroot() system call does not change the current working directory, so . can now lie outside the tree rooted at /.

In all traditional Unix versions, the list of mounts is global. Plan9 introduced the idea of namespaces where each process could see an entirely different tree (or forest) because also the mounts are private for each process. Linux is slowly following this example. Currently a special version of the fork() system call will fork off a child with a private copy of the name space of its parent, so that subsequent chdir() or chroot() or mount() calls in one do not influence the other. More details later.

Filesystem character set

So far we have not given any thoughts to the question how a filename is coded. What sequence of bits or bytes is used for a file called "/bin/ls"?

ASCII

The classical Unix answer is that filenames are in ASCII, mostly because everything is in ASCII. The C statement

 printf("Hello world!\n");

uses a string of which the first byte is 'H' with value 0110 (octal), that is, 72 (decimal), that is, 0x48 (hexadecimal), that is 01001000 (binary). Inside the program it doesn't matter what the corresponding glyph looks like, but as soon as this value is printed we recognize the shape as that of an H.

If the string is used as a filename, it will be stored on disk in some way, such that a later lookup finds the same string again.

ISO 8859-1

Bytes have 8 bits, and when there no longer was any need to use the high order bit as (anti)parity bit, more code points became available, and Western Europe started using ISO 8859-1 (also called Latin-1), an extension of ASCII in which the high order half is used for national characters, so that the Swedes can have and the Danes rdgrd med flde. This is still the default character set on the WWW. But then in Central Europe other symbols were needed and ISO 8859-2 was defined, and the Russians invented KOI-8, and the Japanese Shift-JIS and the Chinese GB5, and Western Europe needed the Euro, etc. etc. The international situation became as bad as the situation in the English speaking world had been before ASCII, with hundreds of different codes.

Exercise The above recipe for "rdgrd med flde" is in ISO 8859-1. What happens if you tell your browser that it is ISO 8859-2? ISO 8859-15? What happens when you paste the text into an xterm?

Unicode

For a global character set the situation in Japan and China is worst. Both countries have a large number of characters - large dictionaries may give 50000 or more. If one byte suffices for 2⁸ = 256 symbols, two bytes suffice for 2¹⁶ = 65536. Enough for China, or for Japan, but not enough for both. However, the Japanese character set is derived from the Chinese one, and many characters are the same, or almost the same. A committee did the "unification" of both alphabets, and Unicode was born: a 16-bit character set with symbols for every character used in any of the world's languages. These days the whole world is slowly converting to Unicode. RedHat 8.0 is the first Linux distribution that uses Unicode by default.

UTF-8

The body of Unix programs is far too large to start rewriting everything. But the 16-bit Unicode values do not fit well into the traditional Unix scheme with strings terminated by a NUL-byte. Indeed, in Unicode 'H' is coded U+0048, and at first sight that consists of the two bytes 00 and 48 (hex). To solve this problem the UTF-8 scheme of coding Unicode was developed. The convention is this:

 0xxxxxxx
 110xxxxx 10xxxxxx
 1110xxxx 10xxxxxx 10xxxxxx
 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The x-es are the bits to be coded, and the code is shown. Values in the range 0-127 are coded by a single byte. Values below 2¹¹ are coded by two bytes. Values below 2¹⁶ are coded by three bytes. Etc. Thus, hex ABCD, that is, binary 1010101111001101 is coded by the three bytes 11101010 10101111 10001101.

The advantage of this code is that the only way bytes in the range 0-127 occur is as themselves. Never as part of a multi-byte code. Very convenient for Americans: the conversion from ASCII to Unicode coded in UTF-8 is trivial: do nothing. This code is called FileSystem-Safe because NULs and / and . characters, the only bytes the kernel looks at when handling filenames, do not occur in unexpected ways.

In Western Europe one has a few accented letters and these now require 2 bytes. That is not serious. In China and Japan everything that used to take 2 bytes now takes 3 bytes, and people are unhappy. So, UTF-16, the 16-bit encoding of Unicode is more popular there. (Apart from the fact that some people in Japan are unhappy because of the unification.)

Java uses Unicode internally and makes locale-dependent filenames.

 % cat > pi.java
 import java.io.*;
 
 class Main {
         public static void main(String[] args) {
                 String s = "\u03c0";    // pi
                 try {
                         FileWriter out = new FileWriter(s);
                         out.write(s);
                         out.flush();
                         out.close();
                 } catch (IOException ee) {
                         System.out.println("Error writing file "+s+": "+ee);
                 }
         }
 }
 % javac pi.java
 % java Main
 % ls
 ?  Main.class  pi.java
 % LC_ALL=nl_NL.utf8
 % export LC_ALL
 % java Main
 % ls
 ?  Main.class  pi.java  �

Here the letter pi was converted into a question mark in the C locale, and into cf 80 (hex) in an utf8 locale. Note that cf 80 is binary 11001111 10000000 and hence codes for 01111000000, that is U+03c0. On a uxterm the filename is shown with a nice Greek pi.

Many Microsoft filesystems encode information about the character set the filenames are in. Unix does not do that, and does not really view filenames as sequences of characters - it views filenames as sequences of bytes.

6.2 The Unix filesystem model

A Unix filesystem has inodes as basic entities. An inode (index node) contains information about the type, owner, size, permissions and timestamps of the file, and points to where on disk this file lives. The inode has room for a short list of block numbers, so if the file is small this suffices, but otherwise indirect blocks are used, that themselves contain a list of block numbers. If this still does not suffice doubly and triply indirect blocks are used.

A file is nothing but an inode. In particular, a file is not characterized by a name. There are special files, called directories, that contain names and the numbers of the corresponding inodes. From a name one can find, by a lookup process, which inode number belongs to that name. From an inode one cannot find a name. There may be zero, one, two, or more names for the same file.

The filesystem keeps track of the number of names a file has, and the system keeps track of the processes that currently have an open file descriptor for the file. When the last name is removed from the filesystem, and all file descriptors for the file have been closed, the blocks that the file occupied can be put onto the free list.

Links

Names of a file are also called links to that file. Or hard links if it is necessary to distinguish them from soft links. All hard links are equivalent.

A soft link, or symbolic link, symlink in short, is a very different animal. It is a very small file that contains the name of another file, and has the property that the system name lookup process will usually automatically follow this redirection. All hard links to a file live in the same filesystem. Symlinks to the file can live in different filesystems. The filesystem keeps track of the number of hard links to a file, but not of the number of symlinks. If you remove the file a symlink pointed to, the symlink becomes a dangling symlink.

Since the Unix file hierarchy should be a tree, there should be precisely one hard link to a directory. Symlinks however can point at arbitrary path names. A file tree walker should be careful not to follow symlinks (or otherwise keep track of all files visited).

Device special files

The Unix philosophy is: "everything is a file". That makes life easy, the same system calls are used to read from any device. Special device nodes are inodes that refer to a device rather than to a file. They come in two kinds: block special devices are block-structured, allow random access, and, in case they contain a filesystem, can be mounted. The typical example is a disk. All (other) devices are character special devices. One has tapes, scanners, modems, etc. Also block special devices can be often accessed via a character special device node.

Device special files are very simple: the inode contains a pair of small integers (the major and minor device numbers), and these numbers are used by the system as index in some table to find the driver for the device.

Other file types

Depending on the Unix version there may be other types of files other than regular files, directories, and device special files. Most versions know about fifos, symlinks and sockets. Try man 2 stat on a Linux system for a list of types.

Sparse files

A Unix filesystem may have sparse files: files with holes. The holes can be read, and then read as all zeroes. They do not take (much) space: the actual data blocks are not written. Sometimes this is important when space is scarce, like on a rescue floppy.

A demo (error handling removed for brevity):

 /* mkhole.c */
 #include <fcntl.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <sys/stat.h>
 
 main(int argc, char **argv){
         char *filename = "AB";
         int fd = open(filename, O_RDWR|O_CREAT|O_TRUNC, 0666);
         int offset = (argc > 1) ? atoi(argv[1]) : 1000000;
         struct stat statbuf;
 
         write(fd, "A", 1);
         lseek(fd, offset, SEEK_SET);
         write(fd, "B", 1);
         close(fd);
         stat(filename, &statbuf);
         printf("The file %s has size %lld and takes %lld sectors\n", filename,
                (long long) statbuf.st_size, (long long) statbuf.st_blocks);
         return 0;
 }

This will create a file with one byte 'A', then a hole, then another byte 'B'. Look at the size of the resulting file (measured in 512-byte sectors):

 % ./mkhole 1
 The file AB has size 2 and takes 2 sectors
 % ./mkhole 1023
 The file AB has size 1024 and takes 2 sectors
 % ./mkhole 1024
 The file AB has size 1025 and takes 4 sectors
 % ./mkhole 12287
 The file AB has size 12288 and takes 4 sectors
 % ./mkhole 12288
 The file AB has size 12289 and takes 6 sectors
 % ./mkhole 274431
 The file AB has size 274432 and takes 6 sectors
 % ./mkhole 274432
 The file AB has size 274433 and takes 8 sectors
 % ./mkhole 67383295
 The file AB has size 67383296 and takes 8 sectors
 % ./mkhole 67383296
 The file AB has size 67383297 and takes 10 sectors
 % ./mkhole 2147483646
 The file AB has size 2147483647 and takes 10 sectors
 % ls -l AB
 -rw-r--r--    1 aeb      users    2147483647 Sep 28 21:51 AB
 % du AB
 5       AB
 % ./mkhole 2147483647
 File size limit exceeded

Apparently this filesystem uses 1024-byte blocks, and has a maximum file size 2147483647. The file of this maximum size, but with a giant hole, takes 5 blocks.

6.3 The Linux filesystem model

Most of what was said earlier about the Unix filesystem holds for the Linux situation. But there are two important differences. On the one hand, there is a new kind of objects, dentries. On the other hand the situation with one large file hierarchy is left. Every process may see a different hierarchy. Also, a single filesystem may be visible several places in the tree.

Dentries

The Unix name resolution process resolves a name into an inode, and no trace of the name remains. In particular, a working directory is known only as inode, and a command like pwd has to go to great trouble (stat the current directory, find device and inode number, then chdir(".."), stat all entries in that directory until an entry with the same device and inode number is found; if found we know the last component of the name; now repeat), and may not find the name, for example because of problems with permissions.

Linux does things in two stages. First the name is resolved into a dentry ("directory entry"). Then the dentry is resolved into an inode. Some names are needed very frequently, such as the files used by the C library. A dentry cache speeds things up. Having dentries allows Linux to do what Unix could not, namely implement a getcwd() system call.

Exercise Play with pwd and getcwd() in several situations.

(i) Explain the following. (This is with bash.)

 % pwd
 /home/aeb
 % ln -s /usr/bin ub
 % cd ub
 % echo $PWD
 /home/aeb/ub
 % pwd
 /home/aeb/ub
 % ls -ld ../bin
 drwxr-xr-x    3 root     root        44528 2002-10-27 18:56 ../bin
 % cd ..
 % ls -ld bin
 drwxr-xr-x    3 aeb      users        3072 2002-10-26 22:31 bin
 % pwd
 /home/aeb

How come that bin has different owner and size? Try the same after first doing

 % set -o physical
 %

(ii) pwd may give the wrong answer, or no answer at all. Try

 % mkdir aa
 % cd aa
 % mv ../aa ../bb
 % echo $PWD
 /home/aeb/aa
 % pwd
 /home/aeb/aa
 % /bin/pwd
 /home/aeb/bb

Try the same after first doing

 % set -o physical
 %

and see that pwd is silent.

(iii) The above shows that there is a layer of user space software between the kernel image of things and the user. For testing it may be less confusing to write a small C program that more directly (there is still the C library) reports what the kernel says.

 /* getcwd.c */
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
 
 char buf[4096];
 
 int main(){
         char *cwd = getcwd(buf, sizeof(buf));
 
         if (!cwd) {
                 perror("getcwd");
                 exit(1);
         }
         printf("%s\n", cwd);
         return 0;
 }

Now do the same things using this utility getcwd.

 % mkdir aa
 % cd aa
 % getcwd
 /home/aeb/aa
 % mv ../aa ../bb
 % getcwd
 /home/aeb/bb
 % rmdir ../bb
 % getcwd
 getcwd: No such file or directory

This shows that the current working directory of a process can be deleted (when it is empty). The inode still exists as long as the process is there, but the name has gone away.

Mounts

Mounting a filesystem at a pathname means that that path now points to the root of that filesystem, instead of at the part of the tree it used to point at. This is Unix semantics. It means that a new inode will be associated with the path after the mount. In particular, the permissions of the directory (it usually is a directory one mounts on) may well change. It also means that part of the old tree may have become invisible to everybody, except for processes that had their working directory below the moint point.

This already shows that also classical Unix knows about the possibility of a file hierarchy that is a forest instead of a tree, more or less as an accident. Linux allows many things unknown in classical Unix. A "bind" mount attaches an already mounted filesystem somewhere else, so that a filesystem can be multiply mounted. (And in such a situation multiple umount calls will be required.)

Exercise Play with mount in several situations. You may have to be root.

 # mkdir tmp
 # mount --bind /tmp ./tmp

Now ./tmp and /tmp are different names that refer to the same subtree.

 # cd tmp
 # umount /home/aeb/tmp
 umount: /home/aeb/tmp: device is busy
 # cd /tmp
 # umount /home/aeb/tmp
 #

This shows that the current working directory is not just an inode. Although tmp and /home/aeb/tmp are two names for the same directory, using one keeps the mount busy, using the other doesn't.

 # cd /home/aeb/tmp
 # mount --bind /tmp /home/aeb/tmp
 # getcwd
 /home/aeb/tmp
 # ls
 .  ..
 # ls /home/aeb/tmp
 .  ..  .X0-lock  .X11-unix
 # umount /home/aeb/tmp

What happens here? The current working directory disappears from view since it is overmounted. This shows that the current working directory is not just a string: ls . and ls /home/aeb/tmp give different results because . and /home/aeb/tmp are different directories (hence have different inodes). One can check the inode number of a file using ls -id file.

 # mkdir /home/aeb/foo
 # mount --bind /tmp /home/aeb/tmp
 # mount --move /home/aeb/tmp /home/aeb/foo
 # ls /home/aeb/foo
 .  ..  .X0-lock  .X11-unix
 # umount /home/aeb/foo

This shows that it is possible to move mounted trees around.

Per-process namespaces

Since 2.4.19/2.5.2, the clone() system call, a generalization of Unix fork() and BSD vfork(), may have the CLONE_NEWNS flag, that says that all mount information must be copied. Afterwards, mount, chroot, pivotroot and similar namespace changing calls done by this new process do influence this process and its children, but not other processes. In particular, the virtual file /proc/mounts that lists the mounted filesystems, is now a symlink to /proc/self/mounts - different processes may live in entirely different file hierarchies.

Exercise Play with this in several situations. You may have to be root.

 /* newnamespace.c */
 #include <stdio.h>
 #include <unistd.h>
 #include <stdlib.h>
 #include <signal.h>
 #include <sched.h>
 
 #ifndef CLONE_NEWNS
 #define CLONE_NEWNS 0x00020000
 #endif
 
 int childfn(void *p) {
         setenv("PS1", "@@ ", 1);
         execl("/bin/ash", "ash", NULL);
         perror("execl");
         fprintf(stderr, "Cannot exec ash.\n");
         return 1;
 }
 
 int main(){
         char buf[10000];
         pid_t pid, p;
 
         pid = clone(childfn, buf+5000, CLONE_NEWNS | SIGCHLD, NULL);
         if ((int) pid == -1) {
                 perror("clone");
                 exit(1);
         }
 
         p = waitpid(pid, NULL, 0);
         if ((int) p == -1) {
                 perror("waitpid");
                 exit(1);
         }
 
         return 0;
 }

This program starts with a namespace that is a copy of the global one, and then forks off a shell (/bin/ash, a simple shell, easier to handle than /bin/sh) and waits for it to finish. If this is under X, you can give a few commands xterm & to ash so as to start some shells in the new namespace. Check that mounts and umounts done in the new namespace are invisible in the old namespace.

BUGS: The program mount does not know about this feature yet, so updates /etc/mtab. Reality is visible in /proc/mounts. Some kernel versions have a bug that would cause the new process to have a strange working directory. Probably that is avoided if this is started with a working directory / or so - not in some mounted filesystem.

6.4 Open files

In order to use a file (not just its name), one has to open it. The result of the open() system call is a small integer, called a file descriptor, index in a per-process table of open files (also called file descriptions) and subsequent read and write accesses will use this integer, often called fd.

A funny part of the socket API is that one can transmit file descriptions over a socket, using the ancillary data part of the sendmsg() and recvmsg() system calls. The sender opens a file, gets file descriptor 5 referring to the corresponding open file description, and sends the integer 5 using sendmsg(). The receiver already had open files numbered 0-8, and when he does recvmsg(), he gets the value 9 as file descriptor referring to the same open file.

7. Filesystems

Filesystems are containers of files, that are stored, probably in a directory tree, together with attributes, like size, owner, creation date and the like. A filesystem has a type. It defines how things are arranged on the disk. For example, one has the types minix, ext2, reiserfs, iso9660, vfat, hfs.

7.1 FAT

The traditional DOS filesystem types are FAT12 and FAT16. Here FAT stands for File Allocation Table: the disk is divided into clusters, the unit used by the file allocation, and the FAT describes which clusters are used by which files.

Let us describe the FAT filesystem in some detail. The FAT12/16 type is important, not only because of the traditional use, but also because it is useful for data exchange between different operating systems, and because it is the filesystem type used by all kinds of devices, like digital cameras.

Layout

First the boot sector (at relative address 0), and possibly other stuff. Together these are the Reserved Sectors. Usually the boot sector is the only reserved sector.

Then the FATs (following the reserved sectors; the number of reserved sectors is given in the boot sector, bytes 14-15; the length of a sector is found in the boot sector, bytes 11-12).

Then the Root Directory (following the FATs; the number of FATs is given in the boot sector, byte 16; each FAT has a number of sectors given in the boot sector, bytes 22-23).

Finally the Data Area (following the root directory; the number of root directory entries is given in the boot sector, bytes 17-18, and each directory entry takes 32 bytes; space is rounded up to entire sectors).

Boot sector

The first sector (512 bytes) of a FAT filesystem is the boot sector. In Unix-like terminology this would be called the superblock. It contains some general information.

First an explicit example (of the boot sector of a DRDOS boot floppy).

 0000000 eb 3f 90 49 42 4d 20 20 33 2e 33 00 02 01 01 00
 0000020 02 e0 00 40 0b f0 09 00 12 00 02 00 00 00 00 00
 0000040 00 00 00 00 00 00 00 00 00 00 70 00 ff ff 49 42
 0000060 4d 42 49 4f 20 20 43 4f 4d 00 50 00 00 08 00 18
 ...

(See here for the complete sector. And also a MSDOS example)

The 2-byte numbers are stored little endian (low order byte first).

 Bytes   Content
 0-2     Jump to bootstrap (E.g. eb 3c 90; on i86: JMP 003E NOP.
         One finds either eb xx 90, or e9 xx xx.
         The position of the bootstrap varies.)
 3-10    OEM name/version (E.g. "IBM  3.3", "MSDOS5.0", "MSWIN4.0".
         Various format utilities leave their own name, like "CH-FOR18".
         Sometimes just garbage. Microsoft recommends "MSWIN4.1".)
         /* BIOS Parameter Block starts here */
 11-12   Number of bytes per sector (512)
         Must be one of 512, 1024, 2048, 4096.
 13      Number of sectors per cluster (1)
         Must be one of 1, 2, 4, 8, 16, 32, 64, 128.
         A cluster should have at most 32768 bytes. In rare cases 65536 is OK.
 14-15   Number of reserved sectors (1)
         FAT12 and FAT16 use 1. FAT32 uses 32.
 16      Number of FAT copies (2)
 17-18   Number of root directory entries (224)
         0 for FAT32. 512 is recommended for FAT16.
 19-20   Total number of sectors in the filesystem (2880)
         (in case the partition is not FAT32 and smaller than 32 MB)
 21      Media descriptor type (f0: 1.4 MB floppy, f8: hard disk; see below)
 22-23   Number of sectors per FAT (9)
         0 for FAT32.
 24-25   Number of sectors per track (12)
 26-27   Number of heads (2, for a double-sided diskette)
 28-29   Number of hidden sectors (0)
         Hidden sectors are sectors preceding the partition.
         /* BIOS Parameter Block ends here */
 30-509  Bootstrap
 510-511 Signature 55 aa

The signature is found at offset 510-511. This will be the end of the sector only in case the sector size is 512.

The ancient media descriptor type codes are:

 For 8" floppies:
 fc, fd, fe - Various interesting formats
 
 For 5.25" floppies:
 Value  DOS version  Capacity  sides  tracks  sectors/track
 ff     1.1           320 KB    2      40      8
 fe     1.0           160 KB    1      40      8
 fd     2.0           360 KB    2      40      9
 fc     2.0           180 KB    1      40      9
 fb                   640 KB    2      80      8
 fa                   320 KB    1      80      8
 f9     3.0          1200 KB    2      80     15
 
 For 3.5" floppies:
 Value  DOS version  Capacity  sides  tracks  sectors/track
 fb                   640 KB    2      80      8
 fa                   320 KB    1      80      8
 f9     3.2           720 KB    2      80      9
 f0     3.3          1440 KB    2      80     18
 f0                  2880 KB    2      80     36
 
 For RAMdisks:
 fa
 
 For hard disks:
 Value  DOS version
 f8     2.0

This code is also found in the first byte of the FAT.

IBM defined the media descriptor byte as 11111red, where r is removable, e is eight sectors/track, d is double sided.

FAT16

FAT16 uses the above BIOS Parameter Block, with some extensions:

 11-27   (as before)
 28-31   Number of hidden sectors (0)
 32-35   Total number of sectors in the filesystem
         (in case the total was not given in bytes 19-20)
 36      Logical Drive Number (for use with INT 13, e.g. 0 or 0x80)
 37      Reserved (Earlier: Current Head, the track containing the Boot Record)
         Used by Windows NT: bit 0: need disk check; bit 1: need surface scan
 38      Extended signature (0x29)
         Indicates that the three following fields are present.
 39-42   Serial number of partition
 43-53   Volume label or "NO NAME    "
 54-61   Filesystem type (E.g. "FAT12   ", "FAT16   ", "FAT     ", or all zero.)
 62-509  Bootstrap
 510-511 Signature 55 aa

FAT32

FAT32 uses an extended BIOS Parameter Block:

 11-27   (as before)     
 28-31   Number of hidden sectors (0)
 32-35   Total number of sectors in the filesystem
 36-39   Sectors per FAT
 40-41   Mirror flags
         Bits 0-3: number of active FAT (if bit 7 is 1)
         Bits 4-6: reserved
         Bit 7: one: single active FAT; zero: all FATs are updated at runtime
         Bits 8-15: reserved
 42-43   Filesystem version
 44-47   First cluster of root directory (usually 2)
 48-49   Filesystem information sector number in FAT32 reserved area (usually 1)
 50-51   Backup boot sector location or 0xffff if none (usually 6)
 52-63   Reserved
 64      Logical Drive Number (for use with INT 13, e.g. 0 or 0x80)
 65      Reserved (used by Windows NT)
 66      Extended signature (0x29)
         Indicates that the three following fields are present.
 67-70   Serial number of partition
 71-81   Volume label
 82-89   Filesystem type ("FAT32   ")

The old 2-byte fields "total number of sectors" and "number of sectors per FAT" are now zero; this information is now found in the new 4-byte fields.

An important improvement is the "First cluster of root directory" field. Earlier, the root directory was not part of the Data Area, and was in a known place with a known size, and hence was unable to grow. Now the root directory is just somewhere in the Data Area.

FAT

The disk is divided into clusters. The number of sectors per cluster is given in the boot sector byte 13.

The File Allocation Table has one entry per cluster. This entry uses 12, 16 or 28 bits for FAT12, FAT16 and FAT32.

The first two FAT entries

The first cluster of the data area is cluster #2. That leaves the first two entries of the FAT unused. In the first byte of the first entry a copy of the media descriptor is stored. The remaining bits of this entry are 1. In the second entry the end-of-file marker is stored. The high order two bits of the second entry are sometimes, in the case of FAT16 and FAT32, used for dirty volume management: high order bit 1: last shutdown was clean; next highest bit 1: during the previous mount no disk I/O errors were detected.

(Historically this description has things backwards: DOS 1.0 did not have a BIOS Parameter Block, and the distinction between single-sided and double-sided 5.25" 360K floppies was indicated by the first byte in the FAT. DOS 2.0 introduced the BPB with media descriptor byte.)

FAT12

Since 12 bits is not an integral number of bytes, we have to specify how these are arranged. Two FAT12 entries are stored into three bytes; if these bytes are uv,wx,yz then the entries are xuv and yzw.

Possible values for a FAT12 entry are: 000: free, 002-fef: cluster in use; the value given is the number of the next cluster in the file, ff0-ff6: reserved, ff7: bad cluster, ff8-fff: cluster in use, the last one in this file. Since the first cluster in the data area is numbered 2, the value 001 does not occur.

DOS 1.0 and 2.0 used FAT12. The maximum possible size of a FAT12 filesystem (volume) was 8 MB (4086 clusters of at most 4 sectors each).

FAT16

DOS 3.0 introduced FAT16. Everything very much like FAT12, only the FAT entries are now 16 bit. Possible values for FAT16 are: 0000: free, 0002-ffef: cluster in use; the value given is the number of the next cluster in the file, fff0-fff6: reserved, fff7: bad cluster, fff8-ffff: cluster in use, the last one in this file.

Now the maximum volume size was 32 MB, mostly since DOS 3.0 used 16-bit sector numbers. This was fixed in DOS 4.0 that uses 32-bit sector numbers. Now the maximum possible size of a FAT16 volume is 2 GB (65526 clusters of at most 64 sectors each).

FAT32

FAT32 was introduced in Windows 95 OSR 2. Everything very much like FAT16, only the FAT entries are now 32 bits of which the top 4 bits are reserved. The bottom 28 bits have meanings similar to those for FAT12, FAT16. For FAT32: Cluster size used: 4096-32768 bytes.

Microsoft operating systems use the following rule to distinguish between FAT12, FAT16 and FAT32. First, compute the number of clusters in the data area (by taking the total number of sectors, subtracting the space for reserved sectors, FATs and root directory, and dividing, rounding down, by the number of sectors in a cluster). If the result is less that 4085 we have FAT12. Otherwise, if it is less than 65525 we have FAT16. Otherwise FAT32.

Microsoft operating systems use fff, ffff, xfffffff as end-of-clusterchain markers, but various common utilities may use different values.

Directory Entry

An example (6 entries on the same MSDOS floppy):

 0009728 49 4f 20 20 20 20 20 20 53 59 53 27 00 00 00 00  IO      .SYS
 0009744 00 00 00 00 00 00 08 5d 62 1b 1d 00 16 9f 00 00
 0009760 4d 53 44 4f 53 20 20 20 53 59 53 27 00 00 00 00  MSDOS   .SYS
 0009776 00 00 00 00 00 00 08 5d 62 1b 6d 00 38 95 00 00
 0009792 43 4f 4d 4d 41 4e 44 20 43 4f 4d 20 00 00 00 00  COMMAND .COM
 0009808 00 00 00 00 00 00 07 5d 62 1b b8 00 39 dd 00 00
 0009824 44 42 4c 53 50 41 43 45 42 49 4e 27 00 00 00 00  DBLSPACE.BIN
 0009840 00 00 00 00 00 00 08 5d 62 1b 27 01 f6 fc 00 00
 0009856 4d 53 44 4f 53 20 20 20 20 20 20 28 00 00 00 00  MSDOS
 0009872 00 00 00 00 00 00 1a 88 99 1c 00 00 00 00 00 00
 0009888 46 44 49 53 4b 20 20 20 45 58 45 20 00 00 00 00  FDISK   .EXE
 0009904 00 00 00 00 00 00 36 59 62 1b 02 00 17 73 00 00

 Bytes   Content
 0-10    File name (8 bytes) with extension (3 bytes)
 11      Attribute - a bitvector. Bit 0: read only. Bit 1: hidden.
         Bit 2: system file. Bit 3: volume label. Bit 4: subdirectory.
         Bit 5: archive. Bits 6-7: unused.
 12-21   Reserved (see below)
 22-23   Time (5/6/5 bits, for hour/minutes/doubleseconds)
 24-25   Date (7/4/5 bits, for year-since-1980/month/day)
 26-27   Starting cluster (0 for an empty file)
 28-31   Filesize in bytes

We see that the fifth entry in the example above is the volume label, while the other entries are actual files.

The "archive" bit is set when the file is created or modified. It is cleared by backup utilities. This allows one to do incremental backups.

As a special kludge to allow undeleting files that were deleted by mistake, the DEL command will replace the first byte of the name by 0xe5 to signify "deleted". As an extraspecial kludge, the first byte 0x05 in a directory entry means that the real name starts with 0xe5.

The first byte of a name must not be 0x20 (space). Short names or extensions are padded with spaces. Special ASCII characters 0x22 ("), 0x2a (*), 0x2b (+), 0x2c (,), 0x2e (.), 0x2f (/), 0x3a (:), 0x3b (;), 0x3c (<), 0x3d (=), 0x3e (>), 0x3f (?), 0x5b ([), 0x5c (\), 0x5d (]), 0x7c (|) are not allowed.

The first byte 0 in a directory entry means that the directory ends here. (Now the Microsoft standard says that all following directory entries should also have first byte 0, but the Psion's OS, EPOC, works with a single terminating 0.)

Subdirectories start with entries for . and .., but the root directory does not have those.

VFAT

In Windows 95 a variation was introduced: VFAT. VFAT (Virtual FAT) is FAT together with long filenames (LFN), that can be up to 255 bytes long. The implementation is an ugly hack. These long filenames are stored in special directory entries. A special entry looks like this:

 Bytes   Content
 0       Bits 0-4: sequence number; bit 6: final part of name
 1-10    Unicode characters 1-5
 11      Attribute: 0xf
 12      Type: 0
 13      Checksum of short name
 14-25   Unicode characters 6-11
 26-27   Starting cluster: 0
 28-31   Unicode characters 12-13

These special entries should not confuse old programs, since they get the 0xf (read only / hidden / system / volume label) attribute combination that should make sure that all old programs will ignore them.

The long name is stored in the special entries, starting with the tail end. The Unicode characters are of course stored little endian. The sequence numbers are ascending, starting with 1.

Now an ordinary directory entry follows these special entries, and it has the usual info (file size, starting cluster, etc.), and a short version of the long name. Also the unused space in directory entries is used now: bytes 13-17: creation date and time (byte 13: centiseconds 0-199, bytes 14-15: time as above, bytes 16-17: date as above), bytes 18-19: date of last access. (And byte 12 is reserved for Windows NT - it indicates whether the filename is in upper or in lower case; byte 20 is reserved for OS/2.)

Old programs might change directories in ways that can separate the special entries from the ordinary one. To guard against that the special entries have in byte 13 a checksum of the short name:

 int i; unsigned char sum = 0;
 
     for (i = 0; i < 11; i++) {
         sum = (sum >> 1) + ((sum & 1) << 7);  /* rotate */
         sum += name[i];                       /* add next name byte */
     }

An additional check is given by the sequence number field. It numbers the special entries belonging to a single LFN 1, 2, ... where the last entry has bit 6 set.

The short name is derived from the long name as follows: The extension is the extension of the long name, truncated to length at most three. The first six bytes of the short name equal the first six nonspace bytes of the long name, but bytes + , ; = [ ], that are not allowed under DOS, are replaced by underscore. Lower case is converted to upper case. The final two (or more, up to seven, if necessary) bytes become ~1, or, if that exists already, ~2, etc., up to ~999999.

VFAT is used in the same way on each of FAT12, FAT16, FAT32.

FSInfo sector

FAT32 stores extra information in the FSInfo sector, usually sector 1.

 Bytes   Content
 0-3     0x41615252 - the FSInfo signature
 4-483   Reserved
 484-487 0x61417272 - a second FSInfo signature
 488-491 Free cluster count or 0xffffffff (may be incorrect)
 492-495 Next free cluster or 0xffffffff (hint only)
 496-507 Reserved
 508-511 0xaa550000 - sector signature

Variations

One meets slight variations on the FAT filesystem in many places. Here a description of the version used for the Xbox.

7.2 Ext2

The ext2 filesystem was developed by R�y Card and added to Linux in version 0.99pl7 (March 1993). It was a greatly improved version of his earlier ext filesystem (that again was a small improvement on the minix filesystem), and uses ideas from the Berkeley Fast Filesystem. It is really fast and robust. The main reason people want something else nowadays is that on the modern very large disks an invocation of e2fsck (to check filesystem integrity after a crash or power failure, or just after some predetermined number of boots) takes a long time, like one hour.

Layout

First, space is reserved for a boot block (1024 bytes). This is not part of the filesystem proper, and ext2 has no opinion about what should be there. This boot block is followed by a number of ext2 block groups.

Each block group starts with a copy of the superblock, then a copy of the group descriptors (for the entire filesystem - ach), then a block bitmap, then an inode bitmap, then an inode table, then data blocks.

An attempt is made to have the data blocks of a file in the same block group as its inode, and as much as possible consecutively, thus minimizing the amount of seeking the disk has to do.

Having a copy of superblock and group descriptors in each block group seemed reasonable when a filesystem had only a few block groups. Later, with thousands of block groups, it became clear that this redundancy was ridiculous and meaningless. (And for a sufficiently large filesystem the group descriptors alone would fill everything, leaving no room for data blocks.) So, later versions of ext2 use a sparse distribution of superblock and group descriptor copies.

The structure of ext2 can be read in the kernel source. The data structures are defined in ext2_fs.h, ext2_fs_i.h, ext2_fs_sb.h (in include/linux) and the code is in fs/ext2.

Exercise Do cd /tmp; dd if=/dev/zero of=e2fs bs=1024 count=10000; mke2fs -F e2fs; od -Ax -tx4 e2fs. (This creates an empty ext2 filesystem in the file /tmp/e2fs, and prints its contents.) Compare the od output with the description below.

The superblock

First, the superblock (of the original ext2, later more fields were added).

 struct ext2_super_block {
         unsigned long s_inodes_count;     /* Inodes count */
         unsigned long s_blocks_count;     /* Blocks count */
         unsigned long s_r_blocks_count;   /* Reserved blocks count */
         unsigned long s_free_blocks_count;/* Free blocks count */
         unsigned long s_free_inodes_count;/* Free inodes count */
         unsigned long s_first_data_block; /* First Data Block */
         unsigned long s_log_block_size;   /* log(Block size) - 10 */
         long s_log_frag_size;             /* Fragment size */
         unsigned long s_blocks_per_group; /* # Blocks per group */
         unsigned long s_frags_per_group;  /* # Fragments per group */
         unsigned long s_inodes_per_group; /* # Inodes per group */
         unsigned long s_mtime;          /* Mount time */
         unsigned long s_wtime;          /* Write time */
         unsigned long s_pad;            /* Padding to get the magic signature*/
                                         /* at the same offset as in the */
                                         /* previous ext fs */
         unsigned short s_magic;         /* Magic signature */
         unsigned short s_valid;         /* Flag */
         unsigned long s_reserved[243];  /* Padding to the end of the block */
 };

The superblock contains information that is global to the entire filesystem, such as the total number of blocks, and the time of the last mount.

Some parameters, such as the number of blocks reserved for the superuser, can be tuned using the tune2fs utility. Having reserved blocks makes the system more stable. It means that if a user program gets out of control and fills the entire disk, the next boot will not fail because of lack of space when utilities at boot time want to write to disk.

The magic signature is 0xef53. The signature is too small: with random data one in every 2^16 blocks will have this value at this offset, so that several hundred blocks will have the ext2 superblock signature without being a superblock. Modern filesystems use a 32-bit or 64-bit signature.

The group descriptors

Then, the group descriptors. Each block group has a group descriptor, and all group descriptors for all block groups are repeated in each group. All copies except the first (in block group 0) of superblock and group descriptors are never used or updated. They are just there to help recovering from a crash or power failure.

The size of a block group depends on the chosen block size for the ext2 filesystem. Allowed block sizes are 1024, 2048 and 4096. The blocks of a block group are represented by bits in a bitmap that fills one block, so with 1024-byte blocks a block group spans 8192 blocks (8 MiB), while with 4096-byte blocks a block group spans 32768 blocks (128 MiB). When disks were small, small blocks were used to minimize the loss in space due to rounding up filesizes to entire blocks. These days the default is to use larger blocks, for faster I/O.

Question What percentage of the disk space is used (wasted) by copies of superblock and group descriptors on a 128 GiB filesystem with 1024-byte blocks? How large is the filesystem when all available space is taken by the group descriptors?

 struct ext2_group_desc
 {
         unsigned long bg_block_bitmap;          /* Blocks bitmap block */
         unsigned long bg_inode_bitmap;          /* Inodes bitmap block */
         unsigned long bg_inode_table;           /* Inodes table block */
         unsigned short bg_free_blocks_count;    /* Free blocks count */
         unsigned short bg_free_inodes_count;    /* Free inodes count */
         unsigned short bg_used_dirs_count;      /* Directories count */
         unsigned short bg_pad;
         unsigned long bg_reserved[3];
 };

Thus, a group descriptor takes 32 bytes, 18 of which are used. The field bg_block_bitmap gives the block number of the block allocation bitmap block. In that block the free blocks in the block group are indicated by 1 bits. Similarly, the field bg_inode_bitmap gives the block number of the inode allocation bitmap. The field bg_inode_table gives the starting block number of the inode table for this block group. These three fields are potentially useful at recovery time. Unfortunately, there is almost no redundancy in a bitmap, so if either a block with group descriptors or a block with a bitmap is damaged, e2fsck will happily go on and destroy the entire filesystem.

Project Investigate how the redundancy present in an ext2 filesystem could be used. Is it possible to detect that a block bitmap or inode bitmap is damaged? Presently, the redundancy present in the repeated superblocks and group descriptors is used only when the user explicitly invokes e2fsck with parameter -b N, where N is the block number of the superblock copy. (Now superblock N and group descriptor blocks N+1.. are used.) How can e2fsck detect automatically that something is wrong, and select and switch to a copy that is better?

The inode table

Each inode takes 128 bytes:

 struct ext2_inode {
         unsigned short i_mode;          /* File mode */
         unsigned short i_uid;           /* Owner Uid */
         unsigned long i_size;           /* Size in bytes */
         unsigned long i_atime;          /* Access time */
         unsigned long i_ctime;          /* Creation time */
         unsigned long i_mtime;          /* Modification time */
         unsigned long i_dtime;          /* Deletion Time */
         unsigned short i_gid;           /* Group Id */
         unsigned short i_links_count;   /* Links count, max 32000 */
         unsigned long i_blocks;         /* Blocks count */
         unsigned long i_flags;          /* File flags */
         unsigned long i_reserved1;
         unsigned long i_block[15];      /* Pointers to blocks */
         unsigned long i_version;        /* File version (for NFS) */
         unsigned long i_file_acl;       /* File ACL */
         unsigned long i_dir_acl;        /* Directory ACL */
         unsigned long i_faddr;          /* Fragment address */
         unsigned char i_frag;           /* Fragment number */
         unsigned char i_fsize;          /* Fragment size */
         unsigned short i_pad1;
         unsigned long i_reserved2[2];
 };

History

When 16-bit uid's no longer sufficed (a PC, and more than one user? yes, university machines handling the mail for more than 65000 people), the part

         unsigned long i_reserved2[2];

was replaced by

         __u16   l_i_uid_high;           /* High order part of uid */
         __u16   l_i_gid_high;           /* High order part of gid */
         __u32   l_i_reserved2;

Also, i_version was renamed into i_generation. This is for use with NFS. If a file is deleted, and later the inode is reused, then a client on another machine, that still had an open filehandle for the old file, must get an error return upon access. This is achieved by changing i_generation. The i_generation field of a file can be read and set using the EXT2_IOC_GETVERSION and EXT2_IOC_SETVERSION ioctls.

Exercise Write the tiny program that can read and change the ext2 version field of an inode. Get the ioctl definitions from <linux/ext2_fs.h>.

File sizes

Look at a filesystem with block size B (e.g., 1024 or 4096). The inode contains 12 pointers to direct blocks: the block numbers of the first 12 data blocks. Files of size not more than 12*B bytes (e.g., 12288 or 49152), do not need more. The 13th element of the array i_block is a pointer to an indirect block. That indirect block contains the block numbers of B/4 data blocks. That suffices for files of size not more than (B/4)*B + 12*B bytes (e.g., 274432 or 4243456). The 14th element of the array i_block is a pointer to a doubly indirect block. It contains the block numbers of B/4 indirect blocks. That suffices for size not more than (B/4)*(B/4)*B + (B/4)*B + 12*B bytes (e.g., 67383296 or 4299210752). The 15th and last element of the array i_block is a pointer to a triply indirect block. It contains the block numbers of B/4 doubly indirect blocks. That suffices up to (B/4)*(B/4)*(B/4)*B + (B/4)*(B/4)*B + (B/4)*B + 12*B bytes (e.g., 17247252480 or 4402345721856). Thus, this design allows for files not larger than about 4 TB. Other conditions may impose a lower limit on the maximum file size.

Sparse files are represented by having block numbers 0 represent holes.

Exercise Explain the sizes given earlier for sparse files in some ext2 filesystem.

Ext2 has fast symbolic links: if the file is a symlink (which is seen from its i_mode field), and the length of the pathname contained in the symlink is less than 60 bytes, then the actual file contents is stored in the i_block[] array, and the fact that this happened is visible from the fact that i_blocks is zero.

As an aside: how large are files in reality? That is of great interest to a filesystem designer. Things of course depend strongly on the type of use, but let me make the statistics on one of my machines.

 Size of ordinary files:
 ("m bits" means "size less than 2^m but not less than 2^{m-1} unless m==0")
 0 bits: 27635
 1 bits: 207
 2 bits: 712
 3 bits: 2839
 4 bits: 12343
 5 bits: 66063
 6 bits: 47328
 7 bits: 45039
 8 bits: 71593
 9 bits: 104873
 10 bits: 171541
 11 bits: 356011
 12 bits: 517599
 13 bits: 283794
 14 bits: 191133
 15 bits: 132640
 16 bits: 70352
 17 bits: 38069
 18 bits: 16614
 19 bits: 8182
 20 bits: 6045
 21 bits: 3023
 22 bits: 1433
 23 bits: 1020
 24 bits: 444
 25 bits: 250
 26 bits: 48
 27 bits: 14
 28 bits: 12
 29 bits: 11
 30 bits: 7
 31 bits: 1

I see here 27635 empty files. The most common size is 12 bits: 2048-4095 bytes. Clearly, in this filesystem the majority of the files only need direct blocks. The fact that many files are small also means that a lot of space is wasted by going to a larger block size.

Other people will have a different distribution. Google developed their own filesystem GFS, a large scale distributed fault-tolerant file system, designed for an environment where one has "a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case".

Reserved inodes

Inode 1 is reserved for a list of bad blocks on the device.

The root directory has inode number 2. This number must be fixed in advance, since the mount() system call must be able to find the root directory.

A few other inode numbers have a special meaning. Inode 11 is the first one for ordinary use.

Directories

Files have pathnames, paths from the root / of the tree to the file. The last element of the path is the file name. Directories are files that give the correspondence between file names and inodes. For the filesystem examined above:

 Max level: 18
 118355 directories
 2176875 regular files
 24407 other

A directory entry

 struct ext2_dir_entry {
         unsigned long inode;            /* Inode number */
         unsigned short rec_len;         /* Directory entry length */
         unsigned short name_len;        /* Name length */
         char name[up to 255];           /* File name */
 };

does not have a fixed length. It gives the inode number and the name. There are two lengths: the length of the name, and after how many bytes the next directory entry starts. These may differ - on the one hand because entries are aligned to start at an address that is a multiple of 4. On the other hand, the presence of a rec_len field allows efficient file deletion: it is not necessary to shift all names in a directory when a file is deleted; instead only the rec_len of the preceding entry is adapted, so that a directory search automatically skips over deleted entries.

Question And what if the entry was the first in the directory? Read the answer in fs/ext2/namei.c.

While names in a symlink are NUL-terminated, names in a directory entry are not.

These days the directory struct is slightly different:

 struct ext2_dir_entry_2 {
         __u32   inode;                  /* Inode number; 0: unused */
         __u16   rec_len;                /* Directory entry length */
         __u8    name_len;               /* Name length */
         __u8    file_type;
         char    name[up to 255];        /* File name */
 };

Since name_len cannot be larger than 255, a single byte suffices, and the other byte, that used to be zero, is now a file type: unknown (0), regular (1), directory (2), character special device (3), block special device (4), fifo (5), socket (6), symlink (7). Some applications are sped up considerably by the presence of this file type, since they now can walk a tree without doing a stat() call on every file encountered.

Limits and efficiency

The macro EXT2_LINK_MAX, defined in ext2_fs.h to the value 32000, makes sure that a directory cannot contain more than 32000 subdirectories. That sometimes causes problems. (I ran into this when trying to extract the about 60000 different files from a collection of about 150000 similar files. Comparing pairwise would take too long, so I decided to move each file to a directory that had its md5sum as name. Now files in the same directory would be almost certainly identical, and a few diffs should remove the duplicates. The script failed after creating 31998 subdirectories.)

Something else is that ext2 does linear searches in its directories, and things get really slow with large directories. Timing on some machine here: create 10000 files in an empty directory: 98 sec; create 10000 files more: 142 sec; create 10000 files more: 191 sec; create 10000 files more: 242 sec. Clear quadratic behaviour. This means that it is a bad idea to store many files in one directory on an ext2 system. Try the same on a reiserfs system: 80 sec, 81 sec, 81 sec, 80 sec. No increase in running time.

Fragments

Fragments have not been implemented.

Attributes

Files on an ext2 filesystem can have various attributes. There are

A	Atime	Linux 2.0	`EXT2_NOATIME_FL`	Don't update the atime field upon access.
S	Sync	Linux 1.0	`EXT2_SYNC_FL`	Perform synchronous writes - do not buffer.
D	Dirsync	Linux 2.6	`EXT2_DIRSYNC_FL`	If this is a directory: sync.
a	Append Only	Linux 1.2	`EXT2_APPEND_FL`	Only allow opening of this file for appending.For directories: disallow file deletion.
i	Immutable	Linux 1.2	`EXT2_IMMUTABLE_FL`	Disallow all changes to this file (data and inode).
j	Journalling	none	`EXT2_JOURNAL_DATA_FL`	(ext3 only) Upon a write, first write to the journal.
d	No Dump	Linux 1.2	`EXT2_NODUMP_FL`	The dump(8) program should ignore this file.
c	Compress	none	`EXT2_COMPR_FL`	Transparently compress this file upon write, uncompress upon read.
s	Secure Deletion	Linux 1.0-1.2 only	`EXT2_SECRM`	When this file is deleted, zero its data blocks.
u	Undelete	none	`EXT2_UNRM_FL`	When this file is deleted, preserve its data blocks,so that later undeletion is possible.
T	Topdir	Linux 2.6	`EXT2_TOPDIR_FL`	Consider this directory a top directory for the Orlov allocator.

Attributes can be viewed with lsattr and changed using chattr. 'A' makes the system a bit faster since it saves some disk I/O. 'a' and 'i' are useful as a defense against hackers, even when they got root. (These bits are read and set using the EXT2_IOC_GETFLAGS and EXT2_IOC_SETFLAGS ioctls.)

Before Linux 2.1 there was a securelevel variable, so that the 'a' and 'i' bits could not be changed by root when it was larger than zero. Moreover, this variable could never be decreased, other than by rebooting. Or at least, that was the idea. However, with memory access one can do anything:

 # cat /proc/sys/kernel/securelevel
 1
 # cat /proc/ksyms | grep securelevel
 001a8f64 securelevel
 # echo "ibase=16; 001A8F64" | bc
 1740644
 # dd if=/dev/zero of=/dev/kmem seek=1740644 bs=1 count=1
 1+0 records in
 1+0 records out
 # cat /proc/sys/kernel/securelevel
 0
 #

Nowadays there is the capability system, see lcap(8), if you have that installed. Using

 # lcap CAP_LINUX_IMMUTABLE
 # lcap CAP_SYS_RAWIO

one disallows root to change the 'a' and 'i' bits, and disallows root to write to the raw disk or raw memory. Without this last restriction, root can do anything, since it can patch the running kernel code.

The above mostly describes an early version of ext2, and there are many details that were skipped. But this should suffice for our purposes.

7.3 Journaling filesystems

A crash caused by power failure or hardware failure or software bug may leave the filesystem in an inconsistent state. The traditional solution is the use of the utilities icheck, dcheck, ncheck, clri. However, with several hundred thousand files and a several GB filesystem checking the filesystem for consistency may take a long time, more than one is prepared to wait. A journaling filesystem has a journal (or log) that records all transactions before they are actually done on the filesystem. After a crash one finds in the journal what data was being modified at the moment of the crash, and bringing the filesystem in a consistent state is now very fast.

(If the crash happens during a write to the journal, we notice and need not do anything: the filesystem is OK and the journal can be erased. If the crash happens during a write to the filesystem, we replay the transactions listed in the journal.)

Of course there is a price: the amount of I/O is doubled. In cases where data integrity is less important but filesystem integrity is essential, one only journals metadata (inode and directory contents, not regular file contents).

There is a different price as well: the old check gave an absolute guarantee that after the, say, e2fsck the filesystem was clean. The check of a journaling filesystem only gives a conditional guarantee: if the power failure or hardware failure or software bug that caused the crash only affected the disk blocks currently being written to according to the journal, then all is well. Especially in the case of kernel bugs this assumption may be wrong.

And then a third price: code complexity.

Linux Journaling Filesystems

Currently, Linux supports four journaling filesystem types: ext3, jfs, reiserfs, xfs. Several other log-structured filesystems are under development.

(Ext3 is a journaling version of ext2, written by Stephen Tweedie. JFS is from IBM. Reiserfs is from Hans Reiser's Namesys. XFS is from SGI, and was ported from IRIX to Linux.) Each has its strengths and weaknesses, but reiserfs seems the most popular. It is the default filesystem type used by SuSE (who employ reiserfs developer Chris Mason). RedHat (employs Stephen Tweedie and) uses ext3.

Linux has a Journaling Block Device layer intended to handle the journaling for all journaling filesystem types. (See fs/jbd.) However, it is used by ext3 only.

7.4 NFS

NFS is a network filesystem. (It is the Network File System.) NFS v2 was released in 1985 by Sun. It allows you to mount filesystems of one computer on the file hierarchy of another computer. Other network filesystem types are smb (Samba, used for file-sharing with Windows machines), ncp (that provides access to Netware server volumes and print queues), Coda and the Andrew filesystem.

Setup

Let us first actually use it, and then look at the theory. On a modern distribution one just clicks "start NFS". Let us go the old-fashioned way and do the setup by hand.

Ingredients: two computers with ethernet connection. Make sure they see each other (ping works). Make sure both machines are running portmap (on each machine rpcinfo -p shows that portmap is running locally and rpcinfo -p othermachine shows that portmap is running remotely). If there are problems, check /etc/hosts.allow, /etc/hosts.deny and the firewall rules.

Is the intended server running NFS?

 # rpcinfo -p knuth
    program vers proto   port
     100000    2   tcp    111  portmapper
     100000    2   udp    111  portmapper
 #

Hmm. It isn't. Then let us first setup and start NFS on the server. We need to tell what is exported and to whom:

 knuth# cat /etc/exports
 / 192.168.1.12(ro)

This says: Export /, that is, everything on the filesystem rooted at /. Export it to 192.168.1.12, read-only.

Several daemons (especially tcpd) use the files /etc/hosts.allow and /etc/hosts.deny (first checking the first to see whether access is allowed, then checking the second to see whether access is denied, finally allowing access). Just to be sure, add

 knuth# cat /etc/hosts.allow
 # Allow local machines
 ALL:192.168.1.0/255.255.255.0
 
 knuth# cat /etc/hosts.deny
 # Deny everybody else
 ALL:ALL
 
 knuth#

Now start the NFS server:

 knuth# /usr/sbin/rpc.mountd
 knuth# /usr/sbin/rpc.nfsd

(Of course, once all is well one wants such commands in one of the startup scripts, but as said, we do things here by hand.) Check that all is well and mount the remote filesystem:

 # rpcinfo -p knuth
    program vers proto   port
     100000    2   tcp    111  portmapper
     100000    2   udp    111  portmapper
     100003    2   udp   2049  nfs
     100003    2   tcp   2049  nfs
     100005    1   udp    635  mountd
     100005    2   udp    635  mountd
     100005    1   tcp    635  mountd
     100005    2   tcp    635  mountd
 # mount -t nfs knuth:/ /knuth -o ro
 #

Done.

A portmapper may not be absolutely required - old versions of mount know that the port should be 635 - but in most setups both local and remote machine must have one.

There are other parts to NFS, e.g. the locking daemon.

What is exported is (part of) a filesystem on the server: the exported tree does not cross mount points. Thus, in the above example, if the server has the proc filesystem mounted at /proc, the client will see an empty directory there.

On some systems the file /etc/exports is not used directly. Instead, a file like /var/lib/nfs/xtab is used, and a utility exportfs(8) is used to initialize and maintain it. Now it is this utility that reads /etc/exports.

Theory

We shall mostly discuss NFS v2 ( RFC 1094).

The most important property of NFS v2/v3 is that it is stateless. The server does not keep state for the client. Each request is self-contained. After a timeout, the request is repeated. As a consequence, the system is robust in the face of server or client crashes. If the client crashes the server never notices. If the server crashes the client keeps repeating the request until the server is up and running again and answers the request.

(This is the default, also called "hard" mounting. "Soft" mounting is when the client gives up after a while and returns an error to the user. "Spongy" mounting is hard for data transfers and soft for status requests: stat, lookup, fsstat, readlink, and readdir.)

Thus, it is expected that retransmissions of requests may occur. And since the server is stateless, it probably has not remembered what requests were served earlier. It follows that one would like to have idempotent requests, where it does not matter whether we do them once or twice. Fortunately, most requests (read, write, chmod) are naturally idempotent. An example of a non-idempotent request is delete. A repetition would get an error return.

Due to the fact that NFS is stateless, it cannot quite reproduce Unix semantics.

For example, the old Unix trick: create a temporary file, and delete it immediately, retaining an open file descriptor, doesn't work. The open descriptor lives on the client, and the server doesn't know anything about it. Consequently the file would be removed, data and everything.

As a partial workaround, the client, who knows that the file is open, does not transmit the delete command to the server, but sends a "silly rename" command instead, renaming the file to some unlikely name that hopefully won't conflict with anything else, like .nfs7200205400000001, and waits until the file has been closed before actually removing this file. This helps, but is less than perfect. It does not protect against other clients that remove the file. If the client crashes, garbage is left on the server.

Somewhat similarly, protection is handled a bit differently. Under Unix, the protection is checked when the file is opened, and once a file has been opened successfully, a subsequent chmod has no influence. The NFS server does not have the concept of open file and checks permissions for each read or write.

As a partial workaround, the NFS server always allows the owner of a file read and write permission. (He could have given himself permission anyway.) Approximate Unix semantics is preserved by the client who can return "access denied" at open time.

A problem on larger networks is that read and write atomicity is lost. On a local machine "simultaneous" reads and writes to a file are serialized, and appear as if done in a well-defined order. Over the network, a read or write request can span many packets, and clients can see a partly old partly new situation.

Thus, locking is important. But of course locking cannot work in a stateless setup. NFS comes with lockd (and statd). It supplies lockf-style advisory locking.

NFS v3 ( RFC 1813) was released in 1993. It contains 64-bit extensions, and efficiency improvements. Only a small step from NFS v2.

NFS v4 ( RFC 3010) became a proposed standard in 2000. It is stateful, allows strong security, has file locking, uses UTF8. Rather different from earlier NFS.

The protocol

XDR

One part of the protocol is the data representation. If the communication channel is a byte stream, and the transmitted data contains 32-bit integers, then an agreement is required in which order one transmits the four bytes making up an integer. And similar questions occur for other data types. Sun defined the Extended Data Representation (XDR) that specifies the representation of the transmitted data. For example, 32-bit integers are transmitted with high-order byte first. All objects and fields of structures are NUL-padded to make their length a multiple of 4. Strings are represented by an integer giving the length followed by the bytes of the string. More generally, variable-sized data is preceded by an integer giving the length. For more details, see RFC 1014.

RPC

Communication is done via Remote Procedure Calls (RPCs). Usually these are sent via (unreliable) UDP, and retransmitted after a timeout. One can also use NFS over TCP.

RPC uses XDR. As a consequence, all small integers take four bytes. Some detail of the protocol is given below.

NFS

NFS uses RPC. All procedures are synchronous: when the (successful) reply is received, the client knows that the operation has been done. (This makes NFS slow: writes have to be committed to stable storage immediately. NFS v3 introduces caching and the COMMIT request to make things more efficient.)

The NFS requests:

0	NFSPROC_NULL	Do nothing
1	NFSPROC_GETATTR	Get File Attributes
2	NFSPROC_SETATTR	Set File Attributes
3	NFSPROC_ROOT	Get Filesystem Root
4	NFSPROC_LOOKUP	Look Up File Name
5	NFSPROC_READLINK	Read From Symbolic Link
6	NFSPROC_READ	Read From File
7	NFSPROC_WRITECACHE	Write to Cache
8	NFSPROC_WRITE	Write to File
9	NFSPROC_CREATE	Create File
10	NFSPROC_REMOVE	Remove File
11	NFSPROC_RENAME	Rename File
12	NFSPROC_LINK	Create Link to File
13	NFSPROC_SYMLINK	Create Symbolic Link
14	NFSPROC_MKDIR	Create Directory
15	NFSPROC_RMDIR	Remove Directory
16	NFSPROC_READDIR	Read From Directory
17	NFSPROC_STATFS	Get Filesystem Attributes

Packets on the wire

On the wire, we find ethernet packets. Unpacking these, we find IP packets inside. The IP packets contain UDP packets. The UDP packets contain RPC packets. It is the RPC (remote procedure call) mechanism that implements NFS.

Ethernet

Just for fun, let us look at an actual packet on the wire: 170 bytes.

 0000  00 10 a4 f1 3c d7 00 e0  4c 39 1b c2 08 00 45 00
 0010  00 9c 00 20 40 00 40 11  b6 cc c0 a8 01 0c c0 a8
 0020  01 08 03 20 08 01 00 88  d3 3d e2 c0 87 0b 00 00
 0030  00 00 00 00 00 02 00 01  86 a3 00 00 00 02 00 00
 0040  00 04 00 00 00 01 00 00  00 30 00 04 99 cd 00 00
 0050  00 05 6d 65 74 74 65 00  00 00 00 00 03 e8 00 00
 0060  00 64 00 00 00 05 00 00  00 64 00 00 00 0e 00 00
 0070  00 10 00 00 00 11 00 00  00 21 00 00 00 00 00 00
 0080  00 00 01 70 00 72 01 70  00 00 00 00 00 00 00 00
 0090  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
 00a0  00 00 00 00 00 03 61 65  62 00

An ethernet packet (frame) consists of

An idle time of at least 9.6 microseconds,
An 8-byte preamble (consisting of 62 alternating bits 1 and 0 followed by two bits 1) used for synchronization (not shown here),
The 6-byte destination MAC [Medium Access Control] address (here 00:10:a4:f1:3c:d7),
The 6-byte source MAC address (here 00:e0:4c:39:1b:c2),
A 2-byte frame type (here 08 00, indicating an IP datagram),
The actual data,
A 4-byte CRC (not shown here).

The specification requires the total length (including CRC, excluding preamble) to be at least 64 and at most 1518 bytes long. For the data that means that it must have length in 46..1500. Frames that are too short ("runts"), or too long ("jabbers"), or have a non-integral number of bytes, or have an incorrect checksum, are discarded. IEEE 802.3 uses a 2-byte length field instead of the type field.

IP

After peeling off this outer MAC layer containing the ethernet transport information, 156 data bytes are left, and we know that we have an IP datagram. IP is described in RFC 791.

An IP datagram consists of a header followed by data. The header consists of:

A byte giving a 4-bit version (here 4: IPv4) and a 4-bit header length measured in 32-bit words (here 5). Thus, the header is
```
 45 00 00 9c  00 20 40 00  40 11 b6 cc  c0 a8 01 0c  c0 a8 01 08
 
```
A byte giving a service type (here 0: nothing special),
Two bytes giving the total length of the datagram (here 00 9c, that is, 156),
Two bytes identification, 3 bits flags, 13 bits fragment offset (measured in multiples of 8 bytes), all related to fragmentation, but this packet was not fragmented - indeed, the flag set is the DF [Don't Fragment] bit,
A byte giving the TTL (time to live, here hex 40, that is, 64),
A byte giving the protocol (here hex 11, that is, UDP),
Two bytes giving a header checksum,
The source IP address (here c0 a8 01 0c, that is, 192.168.1.12),
The destination IP address (here c0 a8 01 08, that is, 192.168.1.8)
Optional IP options, and padding to make the header length a multiple of 4.

UDP

After peeling off the IP layer containing the internet transport information, 136 data bytes are left, and we know we have a UDP datagram:

 03 20 08 01 00 88 d3 3d
 e2 c0 87 0b 00 00 00 00 00 00 00 02 00 01 86 a3
 00 00 00 02 00 00 00 04 00 00 00 01 00 00 00 30
 00 04 99 cd 00 00 00 05 6d 65 74 74 65 00 00 00
 00 00 03 e8 00 00 00 64 00 00 00 05 00 00 00 64
 00 00 00 0e 00 00 00 10 00 00 00 11 00 00 00 21
 00 00 00 00 00 00 00 00 01 70 00 72 01 70 00 00
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00 00 00 00 00 03 61 65 62 00

UDP (User Datagram Protocol) is described in RFC 768. A UDP datagram consists of a 2-byte source port (here 03 20, that is, 800), a 2-byte destination port (here 08 01, that is, 2049), a 2-byte length (here 00 88, that is 136), a 2-byte checksum, and data. So we have 128 bytes of data. On port 2049 the NFS server is running. It gets these 128 bytes, and sees a Remote Procedure Call.

RPC

RPC is described in RFCs 1050, 1057, 1831. An RPC request consists of

A 4-byte xid, the transmission ID. The client generates a unique xid for each request, and the server reply contains this same xid. (Here xid is e2 c0 87 0b.) The client may, but need not, use the same xid upon retransmission after a timeout. The server may, but need not, discard requests with an xid it already has replied to.
A 4-byte direction (0: call, 1: reply). (Here we have a call.)

In the case of a call, the fields following xid and direction are:

A 4-byte RPC version. (Here we have RPC v2.)
A 4-byte program number and a 4-byte version of the program or service called. (Here we have 00 01 86 a3, that is, 100003, the NFS protocol, and 00 00 00 02, that is NFS v2.)
A 4-byte procedure call number. (Here we have 00 00 00 04, procedure 4 of the NFS specification, that is, NFSPROC_LOOKUP.)
Authentication info (see below).
Verification info (see below).
Procedure-specific parameters.

In the case of a reply, the field following xid and direction is:

A 4-byte status (0: accepted, 1: denied)

In the case of a reply with status "accepted", the fields following are:

Verification info (see below), that may allow the client to verify that this reply is really from the server.
A 4-byte result status (0: call executed successfully, 1: program unavailable, 2: version unavailable, 3: procedure unavailable, 4: bad parameters).
In case of a successfully executed call, the results are here. In case of failure because of unavailable version, two 4-byte fields giving the lowest and highest versions supported. In case of other failures, no further data.

In the case of a reply with status "denied", the fields following are:

A 4-byte rejection reason (0: RPC version not 2, 1: authentication error).
In case of RPC version mismatch, two 4-byte fields giving the lowest and highest versions supported. In case of an authentication error a 4-byte reason (1: bad credentials, 2: client must begin new session, 3: bad verifier, 4: verifier expired or replayed, 5: rejected for security reasons).

The three occurrences of "authentication/verification info" above each are structured as follows: first a 4-byte field giving a type (0: AUTH_NULL, 1: AUTH_UNIX, 2: AUTH_SHORT, 3: AUTH_DES, ...), then the length of the authentication data, then the authentication data.

In the case of the packet we are looking at the authentication info is 00 00 00 01 (AUTH_UNIX) 00 00 00 30 (48 bytes) 00 04 99 cd (stamp) 00 00 00 05 (length of machine name: 5 bytes) 6d 65 74 74 65 00 00 00 ("mette" with three bytes padding) 00 00 03 e8 (user id 1000) 00 00 00 64 (group id 100) 00 00 00 05 (5 auxiliary group ids) 00 00 00 64 00 00 00 0e 00 00 00 10 00 00 00 11 00 00 00 21 (100, 14, 16, 17, 33).

The verification info is 00 00 00 00 00 00 00 00 (AUTH_NULL, length 0).

NFS

After peeling off the RPC parts of the packet we are left with the knowledge: this is for NFS, procedure NFSPROC_LOOKUP, and the parameters for this procedure are

 01 70 00 72 01 70 00 00 00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 00 00 00 03 61 65 62 00

Examining the NFS specification, we see that a lookup has as parameter a struct diropargs, with two fields, fhandle dir and filename name. A fhandle is an opaque structure of size FHSIZE, where FHSIZE=32, created by the server to represent a file (given a lookup from the client). A filename is a string of length at most MAXNAMLEN=255. Here we see length 3, and string "aeb".

Altogether, this packet asked for a lookup of the name "aeb" in a directory with specified fhandle.

The reply

 00 e0 4c 39 1b c2 00 10  a4 f1 3c d7 08 00 45 00
 00 9c 04 ec 00 00 40 11  f2 00 c0 a8 01 08 c0 a8
 01 0c 08 01 03 20 00 88  16 ce e2 c0 87 0b 00 00
 00 01 00 00 00 00 00 00  00 00 00 00 00 00 00 00
 00 00 00 00 00 00 c6 98  01 72 02 70 03 00 00 00
 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00 00  00 02 00 00 41 ed 00 00
 00 29 00 00 01 f4 00 00  01 f4 00 00 0c 00 00 00
 10 00 00 00 00 00 00 00  00 06 00 00 00 01 72 01
 98 c6 3d f0 ec fd 00 00  00 00 3d ee 07 54 00 00
 00 00 3d ee 07 54 00 00  00 00

Exercise Interpret this packet. Verify that this is a reply, and that the xid of the reply equals the xid of the request. By some coincidence request and reply have the same length.

The call was successful, and returns

 00 00 00 00 c6 98 01 72  02 70 03 00 00 00 00 00
 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
 00 00 00 00 00 00 00 02  00 00 41 ed 00 00 00 29
 00 00 01 f4 00 00 01 f4  00 00 0c 00 00 00 10 00
 00 00 00 00 00 00 00 06  00 00 00 01 72 01 98 c6
 3d f0 ec fd 00 00 00 00  3d ee 07 54 00 00 00 00
 3d ee 07 54 00 00 00 00

The result of a lookup is a status (here NFS_OK=0), in case NFS_OK followed by the servers fhandle for the file and a fattr giving the file attributes. It is defined by

 struct fattr {
       ftype        type;
       unsigned int mode;
       unsigned int nlink;
       unsigned int uid;
       unsigned int gid;
       unsigned int size;
       unsigned int blocksize;
       unsigned int rdev;
       unsigned int blocks;
       unsigned int fsid;
       unsigned int fileid;
       timeval      atime;
       timeval      mtime;
       timeval      ctime;
 };

Here the file type is 2 (directory), the mode is hex 000041ed, that is octal 40755, a directory with permissions rwxr-xr-x, etc.

Security

NFS is very insecure and can be used only in non-hostile environments.

The AUTH_UNIX authentication uses user id and group id. This is inconvenient: both machines must use the same IDs, and a security problem: someone who has root on his client laptop can give himself any uid and gid desired, and can subsequently access other people's files on the server.

Doing root stuff on the server may be more difficult. Typically user id root on the client is mapped to user id nobody on the server.

7.5 The proc filesystem

The proc filesystem is an example of a virtual filesystem. It does not live on disk. Instead the contents are generated dynamically when its files are read.

Sometimes users are worried that space on their disk is wasted by some huge core file

 % ls -l /proc/kcore
 -r--------    1 root     root     939528192 Sep 30 12:30 /proc/kcore

but this only says something about the amount of memory of the machine.

Many of the files in the proc filesystem have unknown size, since they first get a size when they are generated. Thus, a ls -l will show size 0, while cat will show nonempty contents. Some programs are confused by this.

 % ls -l /proc/version
 -r--r--r--    1 root     root            0 Sep 30 12:31 /proc/version
 % cat /proc/version
 Linux version 2.5.39 (aeb@mette) (gcc version 2.96) #14 Mon Sep 30 03:13:05 CEST 2002

The proc filesystem was designed to hold information about processes, but in the course of time a lot of other cruft was added. Most files are read-only, but some variables can be used to tune kernel behaviour (using echo something > /proc/somefile).

An example

It is easy to add an entry to the proc tree. Let us make /proc/arith/sum that keeps track of the sum of all numbers echoed to this file.

 # insmod sum-module.o
 # ls -l /proc/arith
 total 0
 dr-xr-xr-x    2 root     root            0 Sep 30 12:40 .
 dr-xr-xr-x   89 root     root            0 Sep 30 12:39 ..
 -r--r--r--    1 root     root            0 Sep 30 12:40 sum
 # cat /proc/arith/sum
 0
 # echo 7 > /proc/arith/sum
 # echo 5 > /proc/arith/sum
 # echo 13 > /proc/arith/sum
 # cat /proc/arith/sum
 25
 # rmmod sum-module
 # ls -l /proc/arith
 ls: /proc/arith: No such file or directory
 #

How was this achieved? Here is the code.

 /*
  * sum-module.c
  */
 
 #include <linux/module.h>
 #include <linux/init.h>
 #include <linux/proc_fs.h>
 #include <asm/uaccess.h>
 
 static unsigned long long sum;
 
 static int show_sum(char *buffer, char **start, off_t offset, int length) {
         int size;
 
         size = sprintf(buffer, "%lld\n", sum);
         *start = buffer + offset;
         size -= offset;
         return (size > length) ? length : (size > 0) ? size : 0;
 }
 
 /* Expect decimal number of at most 9 digits followed by '\n' */
 static int add_to_sum(struct file *file, const char *buffer,
                       unsigned long count, void *data) {
         unsigned long val = 0;
         char buf[10];
         char *endp;
 
         if (count > sizeof(buf))
                 return -EINVAL;
         if (copy_from_user(buf, buffer, count))
                 return -EFAULT;
         val = simple_strtoul(buf, &endp, 10);
         if (*endp != '\n')
                 return -EINVAL;
         sum += val;     /* mod 2^64 */
         return count;
 }
 
 static int __init sum_init(void) {
         struct proc_dir_entry *proc_arith;
         struct proc_dir_entry *proc_arith_sum;
 
         proc_arith = proc_mkdir("arith", 0);
         if (!proc_arith) {
                 printk (KERN_ERR "cannot create /proc/arith\n");
                 return -ENOMEM;
         }
         proc_arith_sum = create_proc_info_entry("arith/sum", 0, 0, show_sum);
         if (!proc_arith_sum) {
                 printk (KERN_ERR "cannot create /proc/arith/sum\n");
                 remove_proc_entry("arith", 0);
                 return -ENOMEM;
         }
         proc_arith_sum->write_proc = add_to_sum;
         return 0;
 }
 
 static void __exit sum_exit(void) {
         remove_proc_entry("arith/sum", 0);
         remove_proc_entry("arith", 0);
 }
 
 module_init(sum_init);
 module_exit(sum_exit);
 MODULE_LICENSE("GPL");

seqfiles

In the old days routines generating a proc file would allocate a page and write their output there. A read call would then copy the appropriate part to user space. Typical code was something like

         /* read count bytes from proc file starting at file->f_pos */
         page = get_free_page(GFP_KERNEL);
         length = get_proc_info(page);
         if (file->f_pos < length) {
                 if (count + file->f_pos > length)
                         count = length - file->f_pos;
                 copy_to_user(buf, page + file->f_pos, count);
                 file->f_pos += count;
         }
         free_page(page);

and clearly this works only when the entire content of a proc file fits within a single page. There were other problems as well. For example, when output is generated using sprintf(), it is messy to protect against buffer overflow.

Since 2.5.1 (and 2.4.15) some infrastructure exists for producing generated proc files that are larger than a single page. For an example of use, see the Fibonacci module.

The code

         f = create_proc_entry("foo", mode, NULL);
         if (f)
                 f->proc_fops = &proc_foo_operations;

will create a file /proc/foo with given mode such that opening it yields a file that has proc_foo_operations as struct file_operations. Typically one has something like

 static struct file_operations proc_foo_operations = {
         .open           = foo_open,
         .read           = seq_read,
         .llseek         = seq_lseek,
         .release        = seq_release,
 };

where foo_open is defined as

 static int foo_open(struct inode *inode, struct file *file)
 {
         return seq_open(file, &foo_op);
 }

and foo_op is a struct seq_operations:

 struct seq_operations {
         void * (*start) (struct seq_file *m, loff_t *pos);
         void (*stop) (struct seq_file *m, void *v);
         void * (*next) (struct seq_file *m, void *v, loff_t *pos);
         int (*show) (struct seq_file *m, void *v);
 };
 
 struct seq_operations foo_op = {
         .start    = foo_start,
         .stop     = foo_stop,
         .next     = foo_next,
         .show     = foo_show
 };

To make the proc file work, the foo module has to define the four routines start(), stop(), next(), show(). Each time some amount of data is to be read from the proc file, first start(); show() is done, then a number of times next(); show(), as long as more items fit in the user-supplied buffer, and finally stop().

The start routine can allocate memory, or get locks or down semaphores, and the stop routine can free or unlock or up them again. The values returned by start() and next() are cookies fed to show(), and this latter routine generates the actual output using the specially provided output routines seq_putc(), seq_puts(), seq_printf().

The routines seq_open() etc. are defined in seq_file.c. Here seq_open() initializes a struct seq_file and attaches it to the private_data field of the file structure:

 struct seq_file {
         char *buf;
         size_t size;
         size_t from;
         size_t count;
         loff_t index;
         struct semaphore sem;
         struct seq_operations *op;
         void *private;
 };

(Use a buffer buf of size size. It still contains count unread bytes, starting from buf offset from. We return a sequence of items, and index is the current serial number. The private pointer can be used to point at private data, if desired.)

As an example, look at /proc/mounts, defined in proc/base.c. The open routine here is slightly more complicated, basically something like

 static int mounts_open(struct inode *inode, struct file *file)
 {
         int ret = seq_open(file, &mounts_op);
         if (!ret) {
                 struct seq_file *m = file->private_data;
                 m->private = proc_task(inode)->namespace;
         }
 }

and the start, next, stop, show routines live in namespace.c.

7.6 A baby filesystem example

Let us write a baby filesystem, as an example of how the Virtual File System works. It allows one to mount a block device and then shows the partition table as a file hierarchy, with partitions as files, and links in a chain of logical partitions as directories.

(It helps to know how partition tables work. Very briefly: Start at sector 0, the Master Boot Record. Each partition sector contains 4 descriptors, that either describe a partition, or point at the next partition sector. The partitions described in the MBR are called primary. The others are called logical. The box containing all logical partitions is called the extended partition.)

 # fdisk -lu /dev/hda
    Device Boot    Start       End    Blocks   Id  System
 /dev/hda1   *  16450560  33736499   8642970    5  Extended
 /dev/hda4            63  16450559   8225248+  63  GNU HURD or SysV
 /dev/hda5      16450623  20547134   2048256   83  Linux
 /dev/hda6      20547198  33656174   6554488+  83  Linux
 /dev/hda7      33656238  33736499     40131   82  Linux swap
 # insmod fdiskfs.o; mount -t fdiskfs /dev/hda /mnt
 # ls -R /mnt
 /mnt:
 .  ..  1  2  3  4
 
 /mnt/1:
 .  ..  1  2  3  4
 
 /mnt/1/2:
 .  ..  1  2  3  4
 
 /mnt/1/2/2:
 .  ..  1  2  3  4
 # cat /mnt/4
 sectors 63-16450559, type 63
 # cat /mnt/1/2/2/?
 sectors 63-80324, type 82
 empty slot
 empty slot
 empty slot
 # ls -al /mnt
 total 7
 dr-xr-xr-x    3 root     root            6 1970-01-01 01:00 .
 drwxr-xr-x   44 root     root         4096 2002-12-18 20:18 ..
 dr-xr-xr-x    3 root     root            6 1970-01-01 01:00 1
 -r--r--r--    1 root     root           16 1970-01-01 01:00 2
 -r--r--r--    1 root     root           16 1970-01-01 01:00 3
 -r--r--r--    1 root     root           16 1970-01-01 01:00 4
 # umount /mnt; rmmod fdiskfs

We see a disk with four partition sectors. There are four actual partitions, and three links in the chain of logical partitions, and a lot of empty slots.

Below the code that does this. I wrote it this evening under 2.5.52 - correctness not guaranteed. Don't try it on a disk that is in use - obscure errors may result.

Exercise Change in the code below the regular files into block device nodes giving access to the partition described by the partition descriptor.

 /*
  * fdiskfs.c
  *
  * Compile with
  *
  * gcc -I/path-to-linux/include -nostdinc -iwithprefix include \
  *      -D__KERNEL__ -DMODULE -DKBUILD_MODNAME=fdiskfs \
  *      -O2 -Wall -Wstrict-prototypes -c -o fdiskfs.o fdiskfs.c
  */
 
 #include <linux/module.h>
 #include <linux/types.h>
 #include <linux/errno.h>
 #include <linux/fs.h>
 #include <linux/init.h>
 #include <linux/genhd.h>
 #include <linux/smp_lock.h>
 #include <linux/buffer_head.h>

Three partition types indicate an extended partition, and will be treated as directory. The remaining types become regular files.

 static inline int
 is_extended_type(int type) {
         return (type == DOS_EXTENDED_PARTITION ||
                 type == WIN98_EXTENDED_PARTITION ||
                 type == LINUX_EXTENDED_PARTITION);
 }

The structure of a DOS-type partition table entry.

 typedef struct { unsigned char h,s,c; } chs;
 
 struct fdisk_partition {
         unsigned char bootable;             /* 0 or 0x80 */
         chs begin_chs;
         unsigned char sys_type;
         chs end_chs;
         unsigned int start_sect;
         unsigned int nr_sects;
 };

Read a partition sector from disk.

 static struct buffer_head *
 fdisk_read_sector(struct super_block *s, int sector) {
         struct buffer_head *bh;
         unsigned char *data;
 
         bh = sb_bread(s, sector);
         if (!bh) {
                 printk ("fdiskfs: unable to read sector %d on dev %s\n",
                         sector, s->s_id);
         } else {
                 data = (unsigned char *) bh->b_data;
                 if (data[510] != 0x55 || data[511] != 0xaa) {
                         printk ("No aa55 signature on sector %d of dev %s\n",
                                 sector, s->s_id);
                         brelse(bh);
                         bh = 0;
                 }
         }
         return bh;
 }

Invent some silly scheme of partition numbering. The assumption here is that chains do not fork.

 /* inos: root: 1, primary: 2-5, elsewhere E+(H<<2)+(L<<4)
    E entry (0-3), H chain head (0-3), L chain length (1-max) */
 #define ROOT_INO        1       /* must not be 0 */
 #define PRIMARY_SHIFT   2
 
 #define IS_ROOT(a)      ((a) == ROOT_INO)
 #define IS_PRIMARY(a)   ((a) < 4 + PRIMARY_SHIFT)
 
 #define ROOT_SUB_INO            (PRIMARY_SHIFT)
 #define PRIMARY_SUB_INO(p)      ((((p) - PRIMARY_SHIFT)<<2) + (1<<4))
 #define OTHER_SUB_INO(p)        (((p) & ~3) + (1<<4))
 
 static inline int
 sub_ino(int ino, int pos) {
         return pos + (IS_ROOT(ino) ? ROOT_SUB_INO :
                       IS_PRIMARY(ino) ? PRIMARY_SUB_INO(ino) :
                       OTHER_SUB_INO(ino));
 }
 
 static struct fdisk_partition *
 fdiskfs_find_inode(struct super_block *s, int ino, struct buffer_head **abh) {
         int head, pos, depth;
         unsigned char *data;
         struct fdisk_partition *p;
         int sector, extd, i;
 
         if (IS_ROOT(ino))
                 return NULL;
 
         if (IS_PRIMARY(ino)) {
                 pos = head = ino - PRIMARY_SHIFT;
                 depth = 0;
         } else {
                 pos = (ino & 3);
                 head = ((ino >> 2) & 3);
                 depth = (ino >> 4);
         }
 
         *abh = fdisk_read_sector(s, 0);
         if (!*abh)
                 return NULL;
         data = (*abh)->b_data;
         p = (struct fdisk_partition *)(data + 446 + 16*head);
 
         if (depth == 0)
                 return p;
         extd = sector = p->start_sect;
 
         for (;;) {
                 brelse(*abh);
                 *abh = fdisk_read_sector(s, sector);
                 if (!*abh)
                         return NULL;
                 data = (*abh)->b_data;
                 p = (struct fdisk_partition *)(data + 446);
 
                 if (--depth == 0)
                         return p+pos;
 
                 for (i = 0; i< 4; i++)
                         if (p[i].nr_sects != 0 &&
                             is_extended_type(p[i].sys_type))
                                 break;
                 if (i == 4) {
                         brelse(*abh);
                         *abh = 0;
                         return NULL;
                 }
                 sector = extd + p[i].start_sect;
         }
 }

So far the helper functions. Now VFS code. We need a routine to read a directory, one to lookup a name in a directory, and a routine to read a file.

 static int
 fdiskfs_readdir(struct file *filp, void *dirent, filldir_t filldir) {
         struct inode *dir;
         unsigned long offset, maxoff;
         int i, len, ino, dino, fino;
         int stored = 0;
         char name[3];
 
         lock_kernel();
 
         dir = filp->f_dentry->d_inode;
         dino = dir->i_ino;
         ino = sub_ino(dino, 0);
         offset = filp->f_pos;
         maxoff = 6;
 
         for(;;) {
                 if (offset >= maxoff) {
                         offset = maxoff;
                         filp->f_pos = offset;
                         goto out;
                 }
                 filp->f_pos = offset;
 
                 if (offset == 0) {
                         strcpy(name, ".");
                         fino = dino;
                 } else if (offset == 1) {
                         strcpy(name, "..");
                         fino = parent_ino(filp->f_dentry);
                 } else {
                         i = offset-2;
                         name[0] = '1' + i;
                         name[1] = 0;
                         fino = ino + i;
                 }
                 len = strlen(name);
 
                 if (filldir(dirent, name, len, offset, fino, 0) < 0)
                         goto out;
 
                 stored++;
                 offset++;
         }
 out:
         unlock_kernel();
         return stored;
 }
 
 static struct dentry *
 fdiskfs_lookup(struct inode *dir, struct dentry *dentry) {
         struct inode *inode = NULL;
         const char *name;
         int dino, ino, len, pos;
 
         lock_kernel();
 
         name = dentry->d_name.name;
         len = dentry->d_name.len;
         if (len != 1 || *name < '1' || *name > '4')
                 goto out;
         pos = *name - '1';
 
         dino = dir->i_ino;
         ino = sub_ino(dino, pos);
         inode = iget(dir->i_sb, ino);
  out:
         d_add(dentry, inode);
         unlock_kernel();
         return NULL;
 }
 
 static ssize_t
 fdiskfs_read(struct file *filp, char *buf, size_t count, loff_t *ppos) {
         struct inode *inode;
         struct buffer_head *bh;
         struct fdisk_partition *p;
         int ino, len, offset;
         char file_contents[200];
 
         inode = filp->f_dentry->d_inode;
         ino = inode->i_ino;
 
         p = fdiskfs_find_inode(inode->i_sb, ino, &bh);
         if (!p)
                 goto out;
 
         if (p->nr_sects == 0)
                 sprintf(file_contents, "empty slot\n");
         else
                 sprintf(file_contents, "sectors %d-%d, type %02X%s\n",
                         p->start_sect, p->start_sect + p->nr_sects - 1,
                         p->sys_type, p->bootable ? " boot" : "");
         brelse(bh);
 
         len = strlen(file_contents);
 
         offset = *ppos;
         if (offset >= len) {
                 offset = len;
                 *ppos = offset;
                 return 0;
         }
         len -= offset;
         if (len > count)
                 len = count;
 
         strncpy(buf, file_contents, len);
         *ppos += len;
         return len;
  out:
         return -EIO;
 }
 
 static struct file_operations fdiskfs_dir_operations = {
         .read           = generic_read_dir,
         .readdir        = fdiskfs_readdir,
 };
 
 static struct inode_operations fdiskfs_dir_inode_operations = {
         .lookup         = fdiskfs_lookup,
 };
 
 static struct file_operations fdiskfs_file_operations = {
         .read           = fdiskfs_read,
 };

For the superblock operations we need a method to read an inode.

 static void
 fdiskfs_read_inode(struct inode *i) {
         struct buffer_head *bh = NULL;
         struct fdisk_partition *p;
         int ino, isdir;
 
         ino = i->i_ino;
         if (ino == ROOT_INO) {
                 isdir = 1;
         } else {
                 p = fdiskfs_find_inode(i->i_sb, ino, &bh);
                 if (!p) {
                         printk("fdiskfs: error reading ino %d\n", ino);
                         return;
                 }
                 isdir = is_extended_type(p->sys_type);
                 brelse(bh);
         }
 
         i->i_mtime.tv_sec = i->i_atime.tv_sec = i->i_ctime.tv_sec = 0;
         i->i_mtime.tv_nsec = i->i_atime.tv_nsec = i->i_ctime.tv_nsec = 0;
         i->i_uid = i->i_gid = 0;
 
         if (isdir) {
                 i->i_op = &fdiskfs_dir_inode_operations;
                 i->i_fop = &fdiskfs_dir_operations;
                 i->i_mode = S_IFDIR + 0555;
                 i->i_nlink = 3;         /* ., .., subdirs */
                 i->i_size = 6;
         } else {
                 i->i_fop = &fdiskfs_file_operations;
                 i->i_mode = S_IFREG + 0444;
                 i->i_nlink = 1;
                 i->i_size = 16;
         }
 }
 
 static struct super_operations fdiskfs_ops = {
         .read_inode     = fdiskfs_read_inode,
 };

For the struct file_system_type we need a method that reads the superblock.

 static int
 fdiskfs_fill_super(struct super_block *s, void *data, int silent) {
         struct buffer_head *bh;
 
         sb_set_blocksize(s, 512);
         s->s_maxbytes = 1024;
 
         bh = fdisk_read_sector(s, 0);
         if (!bh)
                 goto out;
         brelse(bh);
 
         s->s_flags |= MS_RDONLY;
         s->s_op = &fdiskfs_ops;
 
         s->s_root = d_alloc_root(iget(s, ROOT_INO));
         if (!s->s_root)
                 goto out;
         return 0;
 
 out:
         return -EINVAL;
 }
 
 static struct super_block *
 fdiskfs_get_sb(struct file_system_type *fs_type, int flags,
                char *dev_name, void *data) {
         return get_sb_bdev(fs_type, flags, dev_name, data, fdiskfs_fill_super);
 }
 
 static struct file_system_type fdiskfs_type = {
         .owner          = THIS_MODULE,
         .name           = "fdiskfs",
         .get_sb         = fdiskfs_get_sb,
         .kill_sb        = kill_block_super,
         .fs_flags       = FS_REQUIRES_DEV,
 };

Finally, the code to register and unregister the filesystem.

 static int __init
 init_fdiskfs(void) {
         return register_filesystem(&fdiskfs_type);
 }
 
 static void __exit
 exit_fdiskfs(void) {
         unregister_filesystem(&fdiskfs_type);
 }
 
 module_init(init_fdiskfs)
 module_exit(exit_fdiskfs)
 MODULE_LICENSE("GPL");

That was all.

8. The Linux Virtual File System

The Linux kernel implements the concept of Virtual File System (VFS, originally Virtual Filesystem Switch), so that it is (to a large degree) possible to separate actual "low-level" filesystem code from the rest of the kernel. The API of a filesystem is described below.

This API was designed with things closely related to the ext2 filesystem in mind. For very different filesystems, like NFS, there are all kinds of problems.

Four main objects: superblock, dentries, inodes, files

The kernel keeps track of files using in-core inodes ("index nodes"), usually derived by the low-level filesystem from on-disk inodes.

A file may have several names, and there is a layer of dentries ("directory entries") that represent pathnames, speeding up the lookup operation.

Several processes may have the same file open for reading or writing, and file structures contain the required information such as the current file position.

Access to a filesystem starts by mounting it. This operation takes a filesystem type (like ext2, vfat, iso9660, nfs) and a device and produces the in-core superblock that contains the information required for operations on the filesystem; a third ingredient, the mount point, specifies what pathname refers to the root of the filesystem.

Auxiliary objects

We have filesystem types, used to connect the name of the filesystem to the routines for setting it up (at mount time) or tearing it down (at umount time).

A struct vfsmount represents a subtree in the big file hierarchy - basically a pair (device, mountpoint).

A struct nameidata represents the result of a lookup.

A struct address_space gives the mapping between the blocks in a file and blocks on disk. It is needed for I/O.

8.1 Terminology

Various objects play a role here. There are file systems, organized collections of files, usually on some disk partition. And there are filesystem types, abstract descriptions of the way data is organized in a filesystem of that type, like FAT16 or ext2. And there is code, perhaps a module, that implements the handling of file systems of a given type. Sometimes this code is called a low-level filesystem, low-level since it sits below the VFS just like low-level SCSI drivers sit below the higher SCSI layers.

8.2 Filesystem type registration

A module implementing a filesystem type must announce its presence so that it can be used. Its task is (i) to have a name, (ii) to know how it is mounted, (iii) to know how to lookup files, (iv) to know how to find (read, write) file contents.

This announcing is done using the call register_filesystem(), either at kernel initialization time or when the module is inserted. There is a single argument, a struct that contains the name of the filesystem type (so that the kernel knows when to invoke it) and a routine that can produce a superblock.

The struct is of type struct file_system_type. Here the 2.2.17 version:

 struct file_system_type {
     const char *name;
     int fs_flags;
     struct super_block *(*read_super) (struct super_block *, void *, int);
     struct file_system_type *next;
 };

The call register_filesystem() hangs this struct in the chain with head file_systems, and unregister_filesystem() removes it again.

Accesses to this chain are protected by the spinlock file_systems_lock. There are no other writers. The main reader is of course the mount() system call (via get_fs_type()). Other readers are get_filesystem_list() used for /proc/filesystems, and the sysfs system call.

The code is in fs/filesystems.c.

 static struct file_system_type tue_fs_type = {
         .owner          = THIS_MODULE,
         .name           = "tue",
         .get_sb         = tue_get_sb,
         .kill_sb        = kill_block_super,
         .fs_flags       = FS_REQUIRES_DEV,
 };
 
 static int __init init_tue_fs(void) {
         return register_filesystem(&tue_fs_type);
 }
 
 static void __exit exit_tue_fs(void)
 {
         unregister_filesystem(&tue_fs_type);
 }

8.3 Struct file_system_type

 struct file_system_type {
         const char *name;
         int fs_flags;
         struct super_block *(*get_sb)(struct file_system_type *,
                 int, char *, void *, struct vfsmount *);
         void (*kill_sb) (struct super_block *);
         struct module *owner;
         struct file_system_type *next;
         struct list_head fs_supers;
         struct lock_class_key s_lock_key;
         struct lock_class_key s_umount_key;
 };

(In 2.4 there was no kill_sb(), and the role of get_sb() was taken by read_super(). The final parameter of get_sb() and the lock_class_key fields are present since 2.6.18.)

Let us look at the fields of the struct file_system_type.

name

Here the filesystem type gives its name ("tue"), so that the kernel can find it when someone does mount -t tue /dev/foo /dir. (The name is the third parameter of the mount system call.) It must be non-NULL. The name string lives in module space. Access must be protected either by a reference to the module, or by the file_systems_lock.

get_sb

At mount time the kernel calls the fstype->get_sb() routine that initializes things and sets up a superblock. It must be non-NULL. Typically this is a 1-line routine that calls one of get_sb_bdev, get_sb_single, get_sb_nodev, get_sb_pseudo.

kill_sb

At umount time the kernel calls the fstype->kill_sb() routine to clean up. It must be non-NULL. Typically one of kill_block_super, kill_anon_super, kill_litter_super.

fs_flags

The fs_flags field of a struct file_system_type is a bitmap, an OR of several possible flags with mostly obscure uses only. The flags are defined in fs.h. This field was introduced in 2.1.43. The number of flags, are their meanings, varies. In 2.6.19 there are the four flags FS_REQUIRES_DEV, FS_BINARY_MOUNTDATA, FS_REVAL_DOT, FS_RENAME_DOES_D_MOVE.

FS_REQUIRES_DEV
The FS_REQUIRES_DEV flag (since 2.1.43) says that this is not a virtual filesystem - an actual underlying block device is required. It is used in only two places: when /proc/filesystems is generated, its absence causes the filesystem type name to be prefixed by "nodev". And in fs/nfsd/export.c this flag is tested in the process of determining whether the filesystem can be exported via NFS. Earlier there were more uses.
FS_BINARY_MOUNTDATA
The FS_BINARY_MOUNTDATA flag (since 2.6.5) is set to tell the selinux code that the mount data is binary, and cannot be handled by the standard option parser. (This flag is set for afs, coda, nfs, smbfs.)
FS_REVAL_DOT
The FS_REVAL_DOT flag (since 2.6.0test4) is set to tell the VFS code (in namei.c) to revalidate the paths "/", ".", ".." since they might have gone stale. (This flag is set for NFS.)
FS_RENAME_DOES_D_MOVE
The FS_RENAME_DOES_D_MOVE flag (since 2.6.19) says that the low-level filesystem will handle d_move() during a rename(). Earlier (2.4.0test6-2.6.19) this was called FS_ODD_RENAME and was used for NFS only, but now this is also useful for ocfs2. See also the discussion of silly rename.
FS_NOMOUNT (gone)
The FS_NOMOUNT flag (2.3.99pre7-2.5.22) says that this filesystem must never be mounted from userland, but is used only kernel-internally. This was used, for example, for pipefs, the implementation of Unix pipes using a kernel-internal filesystem (see fs/pipe.c). Even though the flag has disappeared, the concept remains, and is now represented by the MS_NOUSER flag.
FS_LITTER (gone)
The FS_LITTER flag (2.4.0test3-2.5.7) says that after umount a d_genocide() is needed. This will remove one reference from all dentries in that tree, probably killing all of them, which is necessary in case at creation time the dentries already got reference count 1. (This is typically done for an in-core filesystem where dentries cannot be recreated when needed.) This flag disappeared in Linux 2.5.7 when the explicit kill_super method kill_litter_super was introduced.
FS_SINGLE (gone)
The FS_SINGLE flag (2.3.99pre7-2.5.4) says that there is only a single superblock for this filesystem type, so that only a single instance of this filesystem may exist, possibly mounted in several places.
FS_IBASKET and FS_NO_DCACHE and FS_NO_PRELIM (gone)
The FS_IBASKET was defined in 2.1.43 but never used, and the definition disappeared in 2.3.99pre4. The FS_NO_DCACHE and FS_NO_PRELIM flags were introduced in 2.1.43, but were a mistake and disappeared again in 2.1.44. However, the definitions survived until Linux 2.5.22. For the purposes of these flags, see the comment in 2.1.43:dcache.c.

owner

The owner field of a struct file_system_type points at the module that owns this struct. When doing things that might sleep, we must make sure that the module is not unloaded while we are using its data, and do this with try_inc_mod_count(owner). If this fails then the module was just unloaded. If it succeeds we have incremented a reference count so that the module will not go away before we are done.

This field is NULL for filesystems compiled into the kernel.

Example of the use of owner - sysfs

There exists a strange SYSV system call sysfs that will return (i) a sequence number given a filesystem type, and (ii) a filesystem type given a sequence number, and (iii) the total number of filesystem types registered now. This call is not supported by libc or glibc.

These sequence numbers are rather meaningless since they may change any moment. But this means that one can get a snapshot of the list of filesystem types without looking at /proc/filesystems. For example, the program


  #include <stdio.h>
 #include <linux/unistd.h>
 
 /* define the 3-arg version of sysfs() */
 static _syscall3(int,sysfs,int,option,unsigned int,fsindex,char *,buf);
 
 /* define the 1-arg version of sysfs() */
 static int sysfs1(int i) {
     return sysfs(i,0,NULL);
 }
 
 main(){
     int i, tot;
     char buf[100];      /* how long is a filesystem type name?? */
 
     tot = sysfs1(3);
     if (tot == -1) {
         perror("sysfs(3)");
         exit(1);
     }
 
     for (i=0; i<tot; i++) {
         if (sysfs(2, i, buf)) {
             perror("sysfs(2)");
             exit(1);
         }
         printf("%2d: %s\n", i, buf);
     }
     return 0;
 }

might give output like


   0: ext2
  1: minix
  2: romfs
  3: msdos
  4: vfat
  5: proc
  6: nfs
  7: smbfs
  8: iso9660

The kernel code for copying the names to user space is instructive:

 static int fs_name(unsigned int index, char * buf)
 {
         struct file_system_type * tmp;
         int len, res;
 
         read_lock(&file_systems_lock);
         for (tmp = file_systems; tmp; tmp = tmp->next, index--)
                 if (index <= 0 && try_inc_mod_count(tmp->owner))
                                 break;
         read_unlock(&file_systems_lock);
         if (!tmp)
                 return -EINVAL;
 
         /* OK, we got the reference, so we can safely block */
         len = strlen(tmp->name) + 1;
         res = copy_to_user(buf, tmp->name, len) ? -EFAULT : 0;
         put_filesystem(tmp);
         return res;
 }

In order to walk safely along a linked list we need the read lock. The routines that change links (like register_filesystem) need a write lock. Once the filesystem name with the desired index is found we cannot just copy this name to user space. Maybe the page we want to copy to was swapped out, and getting it back in core takes some time, and maybe the module is unloaded just at that point, and then, when we want to read the name we reference memory that is no longer present. The routine try_inc_mod_count() first gets the module unload lock, then looks whether the module still is present; if so it increases the module's refcount and returns 1 (after releasing the unload lock), otherwise it returns 0. After a successful return of try_inc_mod_count() we own a reference to the module, so that it cannot disappear while we are doing copy_to_user(). The put_filesystem() decreases the module's refcount again.

So this is how the owner field is used: it tells which module must be pinned when we do something with this struct. A module stays as long as its refcount is positive, but can disappear any moment when the refcount becomes zero.

In fs/filesystems.c there is a global variable

 static struct file_system_type *file_systems;

that is the head of the list of known filesystem types. A register_filesystem adds the filesystem to the linked list, an unregister_filesystem removes it again. The field next is the link in this simply linked list. It must be NULL when register_filesystem is called, and is reset to NULL by unregister_filesystem. The list is protected by the file_systems_lock.

fs_supers

The fs_supers field of a struct file_system_type is the head of a list of all superblocks of this type. In each superblock the corresponding link is called s_instances. This list is protected by the spinlock sb_lock. This list is used in sget() for filesystems like NFS where we get a filehandle and must check each superblock of the given type whether it is the right one.

s_lock_key, s_umount_key

These are fields used when CONFIG_LOCKDEP is defined, and take no space otherwise. Used for lock validation.

8.4 Mounting

The mount system call attaches a filesystem to the big file hierarchy at some indicated point. Ingredients needed: (i) a device that carries the filesystem (disk, partition, floppy, CDROM, SmartMedia card, ...), (ii) a directory where the filesystem on that device must be attached, (iii) a filesystem type.

In many cases it is possible to guess (iii) given the bits on the device, but heuristics fail in rare cases. Moreover, sometimes there is no difference on the device, as for example in the case where a FAT filesystem without long filenames must be mounted. Is it msdos? or vfat? That information is only in the user's head. If it must be used later in an environment that cannot handle long filenames it should be mounted as msdos; if files with long names are going to be copied to it, as vfat.

The kernel does not guess (except perhaps at boot time, when the root device has to be found), and requires the three ingredients. In fact the mount system call has five parameters: there are also mount flags (like "read-only") and options, like for ext2 the choice between errors=continue and errors=remount-ro and errors=panic.

The code for sys_mount() is found in fs/namespace.c and fs/super.c. The connection with the filesystem type name is made in do_kern_mount():

         struct file_system_type *type = get_fs_type(fstype);
         struct super_block *sb;
 
         if (!type)
                 return ERR_PTR(-ENODEV);
         sb = type->get_sb(type, flags, name, data);

and this is the only call of the get_sb() routine.

The code for sys_umount() is found in fs/namespace.c and fs/super.c. The counterpart of the just quoted code is the cleanup in deactivate_super():

         fs->kill_sb(s);

and this is the only call of the kill_sb() routine.

8.5 The superblock

The superblock gives global information on a filesystem: the device on which it lives, its block size, its type, the dentry of the root of the filesystem, the methods it has, etc., etc.

 struct super_block {
         dev_t s_dev;
         unsigned long s_blocksize;
         struct file_system_type *s_type;
         struct super_operations *s_op;
         struct dentry *s_root;
         ...
 }

 struct super_operations {
         struct inode *(*alloc_inode)(struct super_block *sb);
         void (*destroy_inode)(struct inode *);
         void (*read_inode) (struct inode *);
         void (*dirty_inode) (struct inode *);
         void (*write_inode) (struct inode *, int);
         void (*put_inode) (struct inode *);
         void (*drop_inode) (struct inode *);
         void (*delete_inode) (struct inode *);
         void (*put_super) (struct super_block *);
         void (*write_super) (struct super_block *);
         int (*sync_fs)(struct super_block *sb, int wait);
         void (*write_super_lockfs) (struct super_block *);
         void (*unlockfs) (struct super_block *);
         int (*statfs) (struct super_block *, struct statfs *);
         int (*remount_fs) (struct super_block *, int *, char *);
         void (*clear_inode) (struct inode *);
         void (*umount_begin) (struct super_block *);
         int (*show_options)(struct seq_file *, struct vfsmount *);
 };

This is enough to get started: the dentry of the root directory tells us the inode of this root directory (and in particular its i_ino), and sb->s_op->read_inode(inode) will read this inode from disk. Now inode->i_op->lookup() allows us to find names in the root directory, etc.

Each superblock is on six lists, with links through the fields s_list, s_dirty, s_io, s_anon, s_files, s_instances, respectively.

The super_blocks list

All superblocks are collected in a list super_blocks with links in the fields s_list. This list is protected by the spinlock sb_lock. The main use is in super.c:get_super() or user_get_super() to find the superblock for a given block device. (Both routines are identical, except that one takes a bdev, the other a dev_t.) This list is also used various places where all superblocks must be sync'ed or all dirty inodes must be written out.

The fs_supers list

All superblocks of a given type are collected in a list headed by the fs_supers field of the struct filesystem_type, with links in the fields s_instances. Also this list is protected by the spinlock sb_lock. See above.

The file list

All open files belonging to a given superblock are chained in a list headed by the s_files field of the superblock, with links in the fields f_list of the files. These lists are protected by the spinlock files_lock. This list is used for example in fs_may_remount_ro() to check that there are no files currently open for writing. See also below.

The list of anonymous dentries

Normally, all dentries are connected to root. However, when NFS filehandles are used this need not be the case. Dentries that are roots of subtrees potentially unconnected to root are chained in a list headed by the s_anon field of the superblock, with links in the fields d_hash. These lists are protected by the spinlock dcache_lock. They are grown in dcache.c:d_alloc_anon() and shrunk in super.c:generic_shutdown_super(). See the discussion in Documentation/filesystems/Exporting.

The inode lists s_dirty, s_io

Lists of inodes to be written out. These lists are headed at the s_dirty (resp. s_io) field of the superblock, with links in the fields i_list. These lists are protected by the spinlock inode_lock. See fs/fs-writeback.c.

8.6 Inodes

An (in-core) inode contains the metadata of a file: its serial number, its protection (mode), its owner, its size, the dates of last access, creation and last modification, etc. It also points to the superblock of the filesystem the file is in, the methods for this file, and the dentries (names) for this file.

 struct inode {
         unsigned long i_ino;
         umode_t i_mode;
         uid_t i_uid;
         gid_t i_gid;
         kdev_t i_rdev;
         loff_t i_size;
         struct timespec i_atime;
         struct timespec i_ctime;
         struct timespec i_mtime;
         struct super_block *i_sb;
         struct inode_operations *i_op;
         struct address_space *i_mapping;
         struct list_head i_dentry;
         ...
 }

In early times, struct inode would end with a union

         union {
                 struct minix_inode_info         minix_i;
                 struct ext2_inode_info          ext2_i;
                 struct ext3_inode_info          ext3_i;
                 struct hpfs_inode_info          hpfs_i;
                 ...
         } u;

to store the filesystemtype specific stuff. One could go from inode to e.g. struct ext3_inode_info via inode->u.ext3_i. This setup was rather dissatisfactory, since it meant that a core data structure had to know about all possible filesystem types (even possible out-of-tree ones) and reserve enough room for the largest one among the struct foofs_inode_info. It also wasted memory.

In Linux 2.5.3 this system was changed, and instead of a big struct inode having a filesystemtype dependent part, we now have big filesystemtype dependent inodes, with a VFS part. Thus, struct ext3_inode_info has as its last field struct inode vfs_inode;, and given the VFS inode inode one finds the ext3 information via EXT3_I(inode), defined as container_of(inode, struct ext3_inode_info, vfs_inode).

The methods of an inode are given in the struct inode_operations.

 struct inode_operations {
         int (*create) (struct inode *, struct dentry *, int);
         struct dentry * (*lookup) (struct inode *, struct dentry *);
         int (*link) (struct dentry *, struct inode *, struct dentry *);
         int (*unlink) (struct inode *, struct dentry *);
         int (*symlink) (struct inode *, struct dentry *, const char *);
         int (*mkdir) (struct inode *, struct dentry *, int);
         int (*rmdir) (struct inode *, struct dentry *);
         int (*mknod) (struct inode *, struct dentry *, int, dev_t);
         int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *);
         int (*readlink) (struct dentry *, char *,int);
         int (*follow_link) (struct dentry *, struct nameidata *);
         void (*truncate) (struct inode *);
         int (*permission) (struct inode *, int);
         int (*setattr) (struct dentry *, struct iattr *);
         int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
         int (*setxattr) (struct dentry *, const char *, const void *, size_t, int);
         ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
         ssize_t (*listxattr) (struct dentry *, char *, size_t);
         int (*removexattr) (struct dentry *, const char *);
 };

Each inode is on four lists, with links through the fields i_hash, i_list, i_dentry, i_devices.

The dentry list

All dentries belonging to this inode (names for this file) are collected in a list headed by the inode field i_dentry with links in the dentry fields d_alias. This list is protected by the spinlock dcache_lock.

The hash list

All inodes live in a hash table, with hash collision chains through the field i_hash of the inode. These lists are protected by the spinlock inode_lock. The appropriate head is found by a hash function; it will be an element of the inode_hashtable[] array when the inode belongs to a superblock, or anon_hash_chain if not.

i_list

Inodes are collected into lists that use the i_list field as link field. The lists are protected by the spinlock inode_lock. An inode is either unused, and then on the chain with head inode_unused, or in use but not dirty, and then on the chain with head inode_in_use, or dirty, and then on one of the per-superblock lists with heads s_dirty or s_io, see above.

i_devices

Inodes belonging to a given block device are collected into a list headed by the bd_inodes field of the block device, with links in the inode i_devices fields. The list is protected by the bdev_lock spinlock. It is used to set the i_bdev field to NULL and to reset i_mapping when the block device goes away.

8.7 Dentries

The dentries encode the filesystem tree structure, the names of the files. Thus, the main parts of a dentry are the inode (if any) that belongs to it, the name (the final part of the pathname), and the parent (the name of the containing directory). There are also the superblocks, the methods, a list of subdirectories, etc.

 struct dentry {
         struct inode *d_inode;
         struct dentry *d_parent;
         struct qstr d_name;
         struct super_block *d_sb;
         struct dentry_operations *d_op;
         struct list_head d_subdirs;
         ...
 }

 struct dentry_operations {
         int (*d_revalidate)(struct dentry *, int);
         int (*d_hash) (struct dentry *, struct qstr *);
         int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
         int (*d_delete)(struct dentry *);
         void (*d_release)(struct dentry *);
         void (*d_iput)(struct dentry *, struct inode *);
 };

Here the strings are given by

 struct qstr {
         const unsigned char *name;
         unsigned int len;
         unsigned int hash;
 };

Each dentry is on five lists, with links through the fields d_hash, d_lru, d_child, d_subdirs, d_alias.

Naming flaw

Some of these names were badly chosen, and lead to confusion. We should do a global replace changing d_subdirs into d_children and d_child into d_sibling.

Value of a dentry

The pathname represented by a dentry, is the concatenation of the name of its parent d_parent, a slash character, and its own name d_name.

However, if the dentry is the root of a mounted filesystem (i.e., if dentry->d_covers != dentry), then its pathname is the pathname of the mount point d_covers. Finally, the pathname of the root of the filesystem (with dentry->d_parent == dentry) is "/", and this is also its d_name.

The d_mounts and d_covers fields of a dentry point back to the dentry itself, except that the d_covers field of the dentry for the root of a mounted filesystem points back to the dentry for the mount point, while the d_mounts field of the dentry for the mount point points at the dentry for the root of a mounted filesystem.

The d_parent field of a dentry points back to the dentry for the directory in which it lives. It points back to the dentry itself in case of the root of a filesystem.

A dentry is called negative if it does not have an associated inode, i.e., if it is a name only.

We see that although a dentry represents a pathname, there may be several dentries for the same pathname, namely when overmounting has taken place. Such dentries have different inodes.

Of course the converse, an inode with several dentries, can also occur.

The above description, with d_mounts and d_covers, was for 2.4. In 2.5 these fields have disappeared, and we only have the integer d_mounted that indicates how many filesystems have been mounted at that point. In case it is nonzero (this is what d_mountpoint() tests), a hash table lookup can find the actual mounted filesystem.

d_hash

Dentries are used to speed up the lookup operation. A hash table dentry_hashtable is used, with an index that is a hash of the name and the parent. The hash collision chain has links through the dentry fields d_hash. This chain is protected by the spinlock dcache_lock.

d_lru

All unused dentries are collected in a list dentry_unused with links in the dentry fields d_lru. This list is protected by the spinlock dcache_lock.

d_child, d_subdirs

All subdirectories of a given directory are collected in a list headed by the dentry field d_subdirs with links in the dentry fields d_child. These lists are protected by the spinlock dcache_lock.

d_alias

All dentries belonging to the same inode are collected in a list headed by the inode field i_dentry with links in the dentry fields d_alias. This list is protected by the spinlock dcache_lock.

8.8 Files

File structures represent open files, that is, an inode together with a current (reading/writing) offset. The offset can be set by the lseek() system call. Note that instead of a pointer to the inode we have a pointer to the dentry - that means that the name used to open a file is known. In particular system calls like getcwd() are possible.

 struct file {
         struct dentry *f_dentry;
         struct vfsmount *f_vfsmnt;
         struct file_operations *f_op;
         mode_t f_mode;
         loff_t f_pos;
         struct fown_struct f_owner;
         unsigned int f_uid, f_gid;
         unsigned long f_version;
         ...
 }

Here the f_owner field gives the owner to use for async I/O signals.

 struct file_operations {
         struct module *owner;
         loff_t (*llseek) (struct file *, loff_t, int);
         ssize_t (*read) (struct file *, char *, size_t, loff_t *);
         ssize_t (*aio_read) (struct kiocb *, char *, size_t, loff_t);
         ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
         ssize_t (*aio_write) (struct kiocb *, const char *, size_t, loff_t);
         int (*readdir) (struct file *, void *, filldir_t);
         unsigned int (*poll) (struct file *, struct poll_table_struct *);
         int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
         int (*mmap) (struct file *, struct vm_area_struct *);
         int (*open) (struct inode *, struct file *);
         int (*flush) (struct file *);
         int (*release) (struct inode *, struct file *);
         int (*fsync) (struct file *, struct dentry *, int datasync);
         int (*aio_fsync) (struct kiocb *, int datasync);
         int (*fasync) (int, struct file *, int);
         int (*lock) (struct file *, int, struct file_lock *);
         ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
         ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);
         ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *);
         ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
         unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 };

Each file is in two lists, with links through the fields f_list, f_ep_links.

f_list

The list with links through f_list was discussed above. It is the list of all files belonging to a given superblock. There is a second use: the tty driver collects all files that are opened instances of a tty in a list headed by tty->tty_files with links through the file field f_list. Conversely, these files point back at the tty via their field private_data.

(This field private_data is also used elsewhere. For example, the proc code uses it to attach a struct seq_file to a file.)

The event poll list

All event poll items belonging to a given file are collected in a list with head f_ep_links, protected by the file field f_ep_lock. (For event poll stuff, see epoll_ctl(2).)

8.9 struct vfsmount

A struct vfsmount describes a mount. The definition lives in mount.h:

 struct vfsmount {
         struct list_head mnt_hash;
         struct vfsmount *mnt_parent;    /* fs we are mounted on */
         struct dentry *mnt_mountpoint;  /* dentry of mountpoint */
         struct dentry *mnt_root;        /* root of the mounted tree */
         struct super_block *mnt_sb;     /* pointer to superblock */
         struct list_head mnt_mounts;    /* list of children, anchored here */
         struct list_head mnt_child;     /* and going through their mnt_child */
         atomic_t mnt_count;
         int mnt_flags;
         char *mnt_devname;              /* Name of device e.g. /dev/dsk/hda1 */
         struct list_head mnt_list;
 };

Long ago (1.3.46) it was introduced as part of the quota code. There was a linked list of struct vfsmounts that contained a device number, device name, mount point name, mount flags, superblock pointer, semaphore, file pointers to quota files and time limits for how long an over-quota situation would be allowed. Nowadays quota have independent bookkeeping, and a struct vfsmount only describes a mount.

These stucts are allocated by alloc_vfsmnt() and released by free_vfsmnt() in namespace.c.

mnt_hash

Vfsmounts live in a hash headed by mount_hashtable[]. The field mnt_hash is the link in the collision chain. This list does not seem to be protected by a lock. They are put into the hash by attach_mnt(), found there by lookup_mnt(), and removed again by detach_mnt(), all from namespace.c.

mnt_parent

Vfsmount for parent.

mnt_mountpoint

Dentry for the mountpoint. The pair (mnt_mountpoint, mnt_parent) (returned by follow_up()) will be the dentry and vfsmount for the parent. Used e.g. in d_path to return the pathname of a dentry.

mnt_root

Dentry for the root of the mounted tree.

mnt_sb

Superblock of the mounted filesystem.

mnt_mounts, mnt_child

The field mnt_mounts of a struct vfsmount is the head of a cyclic list of all submounts (mounts on top of some path relative to the present mount). The remaining links of this cyclic list are stored in the mnt_child fields of its submounting vfsmounts. (And each of these points back at us with its mnt_parent field.) Used in autofs4/expire.c and namespace.c (and nowhere else).

mnt_count

Keep track of users of this structure. Incremented by mntget, decremented by mntput. Initially 1. It will be 2 for a mount that may be unmounted. (Autofs also uses this to test whether a tree is busy.)

mnt_flags

The mount flags, like MNT_NODEV, MNT_NOEXEC, MNT_NOSUID. Earlier also MS_RDONLY (that now is stored in sb->s_flags) and MNT_VISIBLE (came in 2.4.0-test5, went in 2.4.5) that told whether this entry should be visible in /proc/mounts.

mnt_devname

Name used in /proc/mounts.

mnt_list

There was a global cyclic list vfsmntlist containing all mounts, used only to create the contents of /proc/mounts. These days we have per-process namespaces, and the global vfsmntlist has been replaced by current->namespace->list. This list is ordered by the order in which the mounts were done, so that one can do the umounts in reverse order. The field mnt_list contains the pointers for this cyclic list.

8.10 fs_struct

A struct fs_struct determines the interpretation of pathnames referred to by a process (and also, somewhat illogically, contains the umask). The typical reference is current->fs. The definition lives in fs_struct.h:


  struct fs_struct {
         atomic_t count;
         rwlock_t lock;
         int umask;
         struct dentry * root, * pwd, * altroot;
         struct vfsmount * rootmnt, * pwdmnt, * altrootmnt;
 };

Semantics of root and pwd are clear. Remains to discuss altroot.

altroot

In order to support emulation of different operating systems like BSD and SunOS and Solaris, a small wart has been added to the walk_init_root code that finds the root directory for a name lookup.

The altroot field of an fs_struct is usually NULL. It is a function of the personality and the current root, and the sys_personality and sys_chroot system calls call set_fs_altroot().

The effect is determined at kernel compile time. One can define __emul_prefix() in <asm/namei.h> as some pathname, say "usr/gnemul/myOS/". The default is NULL, but some architectures have a definition depending on current->personality. If this prefix is non-NULL, and the corresponding file is found, then set_fs_altroot() will set the altroot and altrootmnt fields of current->fs to dentry and vfsmnt of that file.

A subsequent lookup of a pathname starting with '/' will now first try to use the altroot. If that fails the usual root is used.

8.11 nameidata

A struct nameidata represents the result of a lookup. The definition lives in fs.h:


  struct nameidata {
         struct dentry *dentry;
         struct vfsmount *mnt;
         struct qstr last;
         unsigned int flags;
         int last_type;
 };

The typical use is:


          struct nameidata nd;
         error = user_path_walk(filename, &nd);
         if (!error)
                 path_release(&nd);

where path_release() does


          dput(nd->dentry);
         mntput(nd->mnt);

The core of the routines user_path_walk_link and user_path_walk (which call __user_walk without or with the LOOKUP_FOLLOW flag) is the fragment


          if (path_init(name, flags, nd))
                 error = path_walk(name, nd);

So the basic routines handling nameidata are path_init and path_walk. The former finds the start of the walk, the latter does the walking. (However, the former returns 0 in case it did the walking itself already.)

path_init

The routine path_init initialises the four fields dentry, mnt, flags, last_type. The flags field was given as an argument, and dentry and mnt are initialised to those of the current directory or those of the root directory depending on whether name starts with a '/' or not. It will always return 1 except in a certain obscure case discussed below, where the return 0 means that the complete lookup was done already. (And this case cannot occur for sys_chroot, that is why the code there needs not check the return value.)

A wart

(path_init will always return 1, except when name starts with a '/', in which case it returns whatever walk_init_root returns. walk_init_root will always return 1, except when current->fs->altroot is non-NULL and nd->flags does not contain LOOKUP_NOALT (for sys_chroot it does) and __emul_lookup_dentry succeeds, which it does when pathwalk succeeds - in this case no path_walk is required anymore)

9. Memory

9.1 Physical and virtual memory

Traditionally, one has physical memory, that is, memory that is actually present in the machine, and virtual memory, that is, address space. Usually the virtual memory is much larger than the physical memory, and some hardware or software mechanism makes sure that a program can transparently use this much larger virtual space while in fact only the physical memory is available.

Nowadays things are reversed: on a Pentium II one can have 64 GB physical memory, while addresses have 32 bits, so that the virtual memory has a size of 4 GB. We'll have to wait for a 64-bit architecture to get large amounts of virtual memory again. The present situation on a Pentium with more than 4 GB is that using the PAE (Physical Address Extension) it is possible to place the addressable 4 GB anywhere in the available memory, but it is impossible to have access to more than 4 GB at once.

9.2 Kinds of memory

Kernel and user space work with virtual addresses (also called linear addresses) that are mapped to physical addresses by the memory management hardware. This mapping is defined by page tables, set up by the operating system.

DMA devices use bus addresses. On an i386 PC, bus addresses are the same as physical addresses, but other architectures may have special address mapping hardware to convert bus addresses to physical addresses.

Under Linux one has

 #include <asm/io.h>
 
 phys_addr = virt_to_phys(virt_addr);
 virt_addr = phys_to_virt(phys_addr);
  bus_addr = virt_to_bus(virt_addr);
 virt_addr = bus_to_virt(bus_addr);

All this is about accessing ordinary memory. There is also "shared memory" on the PCI or ISA bus. It can be mapped inside a 32-bit address space using ioremap(), and then used via the readb(), writeb() (etc.) functions.

Life is complicated by the fact that there are various caches around, so that different ways to access the same physical address need not give the same result.

See asm/io.h and Documentation/{IO-mapping.txt,DMA-mapping.txt,DMA-API.txt}.

9.3 Kernel memory handling

Pages

The basic unit of memory is the page. Nobody knows how large a page is (that is why the Linux swapspace structure, with a signature at the end of a page, is so unfortunate), this is architecture-dependent, but typically PAGE_SIZE = 4096. (PAGE_SIZE equals 1 << PAGE_SHIFT, and PAGE_SHIFT is 12, 13, 14, 15, 16 on the various architectures). If one is lucky, the getpagesize() system call returns the page size.

Usually, the page size is determined by the hardware: the relation between virtual addresses and physical addresses is given by page tables, and when a virtual address is referenced that does not (yet) correspond to a physical address, a page fault occurs, and the operating system can take appropriate action. Most hardware allows a very limited choice of page sizes. (For example, a Pentium II knows about 4KiB and 4MiB pages.)

Kernel memory allocation

Buddy system

The kernel uses a buddy system with power-of-two sizes. For order 0, 1, 2, ..., 9 it has lists of areas containing 2^order pages. If a small area is needed and only a larger area is available, the larger area is split into two halves (buddies), possibly repeatedly. In this way the waste is at most 50%. Since 2.5.40, the number of free areas of each order can be seen in /proc/buddyinfo. When an area is freed, it is checked whether its buddy is free as well, and if so they are merged. Read the code in mm/page_alloc.c.

get_free_page

The routine __get_free_page() will give us a page. The routine __get_free_pages() will give a number of consecutive pages. (A power of two, from 1 to 512 or so. The above buddy system is used.)

kmalloc

The routine kmalloc() is good for an area of unknown, arbitrary, smallish length, in the range 32-131072 (more precisely: 1/128 of a page up to 32 pages), preferably below 4096. For the sizes, see <linux/kmalloc_sizes.h>. Because of fragmentation, it will be difficult to get large consecutive areas from kmalloc(). These days kmalloc() returns memory from one of a series of slab caches (see below) with names like "size-32", ..., "size-131072".

Priority

Each of the above routines has a flags parameter (a bit mask) indicating what behaviour is allowed. Deadlock is possible when, in order to free memory, some pages must be swapped out, or some I/O must be completed, and the driver needs memory for that. This parameter also indicated where we want the memory (below 16M for ISA DMA, or ordinary memory, or high memory). Finally, there is a bit specifying whether we would like a hot or a cold page (that is, a page likely to be in the CPU cache, or a page not likely to be there). If the page will be used by the CPU, a hot page will be faster. If the page will be used for device DMA the CPU cache would be invalidated anyway, and a cold page does not waste precious cache contents.

Some common flag values (see linux/gfp.h):

 /* Zone modifiers in GFP_ZONEMASK (see linux/mmzone.h - low four bits) */
 #define __GFP_DMA       0x01
 
 /* Action modifiers - doesn't change the zoning */
 #define __GFP_WAIT      0x10    /* Can wait and reschedule? */
 #define __GFP_HIGH      0x20    /* Should access emergency pools? */
 #define __GFP_IO        0x40    /* Can start low memory physical IO? */
 #define __GFP_FS        0x100   /* Can call down to low-level FS? */
 #define __GFP_COLD      0x200   /* Cache-cold page required */
 
 #define GFP_NOIO        (__GFP_WAIT)
 #define GFP_NOFS        (__GFP_WAIT | __GFP_IO )
 #define GFP_ATOMIC      (__GFP_HIGH)
 #define GFP_USER        (__GFP_WAIT | __GFP_IO | __GFP_FS)
 #define GFP_KERNEL      (__GFP_WAIT | __GFP_IO | __GFP_FS)

Uses:

GFP_KERNEL is the default flag. Sleeping is allowed.
GFP_ATOMIC is used in interrupt handlers. Never sleeps.
GFP_USER for user mode allocations. Low priority, and may sleep. (Today equal to GFP_KERNEL.)
GFP_NOIO must not call down to drivers (since it is used from drivers).
GFP_NOFS must not call down to filesystems (since it is used from filesystems -- see, e.g., dcache.c:shrink_dcache_memory and inode.c:shrink_icache_memory).

vmalloc

The routine vmalloc() has a similar purpose, but has a better chance of being able to return larger consecutive areas, and is more expensive. It uses page table manipulation to create an area of memory that is consecutive in virtual memory, but not necessarily in physical memory. Device I/O to such an area is a bad idea. It uses the above calls with GFP_KERNEL to get its memory, so cannot be used in interrupt context.

bigphysarea

There is a patch around, the " bigphysarea patch", allowing one to reserve a large area at boot time. Some devices need a lot of room (for video frame grabbing, scanned images, wavetable synthesis, etc.). See, e.g., video/zr36067.c.

The slab cache

The routine kmalloc is general purpose, and may waste up to 50% space. If certain size areas are needed very frequently, it makes sense to have separate pools for them. For example, Linux has separate pools for inodes, dentries, buffer heads. The pool is created using kmem_cache_create(), and allocation is by kmem_cache_alloc().

The number of special purpose caches is increasing quickly. I have here a 2.4.18 system with 63 slab caches, and a 2.5.56 one with 149 slab caches. Probably this number should be reduced a bit again.

Statistics is found in /proc/slabinfo (and here also statistics on the kmalloc caches is given). From the man page:

        % man 5 slabinfo
        ...
        Frequently used objects in the Linux kernel (buffer heads,
        inodes, dentries, etc.)  have their own  cache.  The  file
        /proc/slabinfo gives statistics. For example:
 
               % cat /proc/slabinfo
               slabinfo - version: 1.1
               kmem_cache            60     78    100    2    2    1
               blkdev_requests     5120   5120     96  128  128    1
               mnt_cache             20     40     96    1    1    1
               inode_cache         7005  14792    480 1598 1849    1
               dentry_cache        5469   5880    128  183  196    1
               filp                 726    760     96   19   19    1
               buffer_head        67131  71240     96 1776 1781    1
               vm_area_struct      1204   1652     64   23   28    1
               ...
               size-8192              1     17   8192    1   17    2
               size-4096             41     73   4096   41   73    1
               ...
 
        For  each  slab  cache, the cache name, the number of cur
        rently active  objects,  the  total  number  of  available
        objects,  the  size of each object in bytes, the number of
        pages with at least one active object, the total number of
        allocated  pages,  and  the  number  of pages per slab are
        given.

Question One of the reasons given for a slab cache is that it will give areas of the precise required size, and thus reduce memory waste. In the man page example above, look at how many pages are used for each slab cache, and how much space it would have cost to kmalloc() everything. Does the slab cache do better?

Having a slab cache has some other advantages besides reducing waste. Allocation and freeing is faster than with the buddy system. There is an attempt at doing some cache colouring.

9.4 i386 addressing

Let us look at an explicit machine to see how addressing is done. This is only a very brief description. For all details, see some Intel manual. (I have here 24547209.pdf, entitled IA-32 Intel Architecture Software Developer's Manual Volume 3, with Chapter 3: Protected Mode Memory Management. It can be downloaded from developer.intel.com.)

Addressing goes in two stages: a logical address determined by segment and offset is translated into a linear address, and next the linear address is translated into a physical address.

Real mode assembler programmers may be familiar with the first stage: one has a 16-bit segment register with value S and 16- or 32-bit offset O, that produce a 20- or 32-bit linear address A = 16*S + O, and this is also the physical address. There are six segment registers: CS, SS, DS, ES, FS, GS. The first is used when instructions are fetched, the second for stack access, the third for data access, and override prefixes allow one to specify any of the six.

Segmentation

In protected mode one still has segmentation (although most systems do not use it - the corresponding hardware is just wasted and present only for compatibility reasons), but one also has paging. The paging is optional (it can be enabled/disabled), but this unused segmentation is always present, and much more complicated than in real mode.

Instead of a single 16-bit value, the segmentation register S is now a 13-bit index I followed by the single bit value TI (Table Index) and the 2-bit value RPL (Requested Protection Level). The bit TI selects one of two tables: the GDT (Global Descriptor Table) or the LDT (Local Descriptor Table). Each can have up to 8192 entries, and the 13-bit index I selects an entry in one of these tables.

Such an entry is an 8-byte struct that has a 32-byte field base, and a 20-byte field limit and several small fields. Here base gives the start of the segment, and limit its length minus one. One of the small fields is a 1-bit value G (Granularity). If this is zero then limit is measured in bytes, if it is one then limit is measured in 4 KiB units. Now that we have the base address B of the segment, add the 32-bit offset O to obtain the linear address.

Paging

When paging is enabled (as it is under Linux) we are only halfway, and this useless segmentation stuff is followed by the extremely useful paging. The linear address A is divided into three parts: a 10-bit directory AD, a 10-bit table AT and a 12-bit offset AO. The corresponding physical address is now found in four stages.

(i) The CR3 processor register points at the page directory.

(ii) Index the page directory with AD to find a page table.

(iii) Index the page table with AT to find the page.

(iv) Index the page with AO to find the physical address.

Thus, the page directory is a table with up to 1024 entries. Each of these entries points to a page table that itself has up to 1024 entries. Each of these point to a page. The entries in page directory and page table have 32 bits, but only the first 20 are needed for the address since by convention pages and page tables start at a 4 KiB boundary so that the last 12 address bits are zero. In page directory and page table these last bits are used for bookkeeping purposes. By far the most important bit is the "present" bit. If the user program refers to a page that is not actually present (according to this bit in the page table) a page fault is generated, and the operating system might fetch the page, map it somewhere in memory, and continue the user program.

Apart from the "present" bit there is the "dirty" bit, that indicates whether the page was modified, the "accessed" bit, that indicates that the page was read, and permission bits.

The above was for 4 KiB pages. There are also 4 MiB pages, where one indexing stage is absent and the offset part has 22 bits. (With PAE one also has 2 MiB pages.)

For many pages of detail, see the Intel manuals.

9.5 Reference

Recently Mel Gorman released his thesis, a detailed description and commentary of the virtual memory management of Linux 2.4.

9.6 Overcommit and OOM

Normally, a user-space program reserves (virtual) memory by calling malloc(). If the return value is NULL, the program knows that no more memory is available, and can do something appropriate. Most programs will print an error message and exit, some first need to clean up lockfiles or so, and some smarter programs can do garbage collection, or adapt the computation to the amount of available memory. This is life under Unix, and all is well.

Linux on the other hand is seriously broken. It will by default answer "yes" to most requests for memory, in the hope that programs ask for more than they actually need. If the hope is fulfilled Linux can run more programs in the same memory, or can run a program that requires more virtual memory than is available. And if not then very bad things happen.

What happens is that the OOM killer (OOM = out-of-memory) is invoked, and it will select some process and kill it. One holds long discussions about the choice of the victim. Maybe not a root process, maybe not a process doing raw I/O, maybe not a process that has already spent weeks doing some computation. And thus it can happen that one's emacs is killed when someone else starts more stuff than the kernel can handle. Ach. Very, very primitive.

Of course, the very existence of an OOM killer is a bug.

A typical case: I do umount -a in a situation where 30000 filesystems are mounted. Now umount runs out of memory and the kernel log reports

 Sep 19 00:33:10 mette kernel: Out of Memory: Killed process 8631 (xterm).
 Sep 19 00:33:34 mette kernel: Out of Memory: Killed process 9154 (xterm).
 Sep 19 00:34:05 mette kernel: Out of Memory: Killed process 6840 (xterm).
 Sep 19 00:34:42 mette kernel: Out of Memory: Killed process 9066 (xterm).
 Sep 19 00:35:15 mette kernel: Out of Memory: Killed process 9269 (xterm).
 Sep 19 00:35:43 mette kernel: Out of Memory: Killed process 9351 (xterm).
 Sep 19 00:36:05 mette kernel: Out of Memory: Killed process 6752 (xterm).

Randomly xterm windows are killed, until the xterm window that was X's console is killed. Then X exits and all user processes die, including the umount process that caused all this.

OK. This is very bad. People lose long-running processes, lose weeks of computation, just because the kernel is an optimist.

Demo program 1: allocate memory without using it.

 #include <stdio.h>
 #include <stdlib.h>
 
 int main (void) {
         int n = 0;
 
         while (1) {
                 if (malloc(1<<20) == NULL) {
                         printf("malloc failure after %d MiB\n", n);
                         return 0;
                 }
                 printf ("got %d MiB\n", ++n);
         }
 }

Demo program 2: allocate memory and actually touch it all.

 #include <stdio.h>
 #include <string.h>
 #include <stdlib.h>
 
 int main (void) {
         int n = 0;
         char *p;
 
         while (1) {
                 if ((p = malloc(1<<20)) == NULL) {
                         printf("malloc failure after %d MiB\n", n);
                         return 0;
                 }
                 memset (p, 0, (1<<20));
                 printf ("got %d MiB\n", ++n);
         }
 }

Demo program 3: first allocate, and use later.

 #include <stdio.h>
 #include <string.h>
 #include <stdlib.h>
 
 #define N       10000
 
 int main (void) {
         int i, n = 0;
         char *pp[N];
 
         for (n = 0; n < N; n++) {
                 pp[n] = malloc(1<<20);
                 if (pp[n] == NULL)
                         break;
         }
         printf("malloc failure after %d MiB\n", n);
 
         for (i = 0; i < n; i++) {
                 memset (pp[i], 0, (1<<20));
                 printf("%d\n", i+1);
         }
 
         return 0;
 }

Typically, the first demo program will get a very large amount of memory before malloc() returns NULL. The second demo program will get a much smaller amount of memory, now that earlier obtained memory is actually used. The third program will get the same large amount as the first program, and then is killed when it wants to use its memory. (On a well-functioning system, like Solaris, the three demo programs obtain the same amount of memory and do not crash but see malloc() return NULL.)

For example:

On an 8 MiB machine without swap running 1.2.11:
demo1: 274 MiB, demo2: 4 MiB, demo3: 270 / oom after 1 MiB: Killed.
Idem, with 32 MiB swap:
demo1: 1528 MiB, demo2: 36 MiB, demo3: 1528 / oom after 23 MiB: Killed.
On a 32 MiB machine without swap running 2.0.34:
demo1: 1919 MiB, demo2: 11 MiB, demo3: 1919 / oom after 4 MiB: Bus error.
Idem, with 62 MiB swap:
demo1: 1919 MiB, demo2: 81 MiB, demo3: 1919 / oom after 74 MiB: The machine hangs. After several seconds: Out of memory for bash. Out of memory for crond. Bus error.
On a 256 MiB machine without swap running 2.6.8.1:
demo1: 2933 MiB, demo2: after 98 MiB: Killed. Also: Out of Memory: Killed process 17384 (java_vm). demo3: 2933 / oom after 135 MiB: Killed.
Idem, with 539 MiB swap:
demo1: 2933 MiB, demo2: after 635 MiB: Killed. demo3: oom after 624 MiB: Killed.

Other kernels have other strategies. Sometimes processes get a segfault when accessing memory that the kernel is unable to provide, sometimes they are killed, sometimes other processes are killed, sometimes the kernel hangs.

Turning off overcommit

Going in the wrong direction

Since 2.1.27 there are a sysctl VM_OVERCOMMIT_MEMORY and proc file /proc/sys/vm/overcommit_memory with values 1: do overcommit, and 0 (default): don't. Unfortunately, this does not allow you to tell the kernel to be more careful, it only allows you to tell the kernel to be less careful. With overcommit_memory set to 1 every malloc() will succeed. When set to 0 the old heuristics are used, the kernel still overcommits.

Going in the right direction

Since 2.5.30 the values are: 0 (default): as before: guess about how much overcommitment is reasonable, 1: never refuse any malloc(), 2: be precise about the overcommit - never commit a virtual address space larger than swap space plus a fraction overcommit_ratio of the physical memory. Here /proc/sys/vm/overcommit_ratio (by default 50) is another user-settable parameter. It is possible to set overcommit_ratio to values larger than 100. (See also Documentation/vm/overcommit-accounting.)

After

 # echo 2 > /proc/sys/vm/overcommit_memory

all three demo programs were able to obtain 498 MiB on this 2.6.8.1 machine (256 MiB, 539 MiB swap, lots of other active processes), very satisfactory.

However, without swap, no more processes could be started - already more than half of the memory was committed. After

 # echo 80 > /proc/sys/vm/overcommit_ratio

all three demo programs were able to obtain 34 MiB. (Exercise: solve two equations with two unknowns and conclude that main memory was 250 MiB, and the other processes took 166 MiB.)

One can view the currently committed amount of memory in /proc/meminfo, in the field Committed_AS.

10. Processes

Before looking at the Linux implementation, first a general Unix description of threads, processes, process groups and sessions.

A session contains a number of process groups, and a process group contains a number of processes, and a process contains a number of threads.

A session can have a controlling tty. At most one process group in a session can be a foreground process group. An interrupt character typed on a tty ("Teletype", i.e., terminal) causes a signal to be sent to all members of the foreground process group in the session (if any) that has that tty as controlling tty.

All these objects have numbers, and we have thread IDs, process IDs, process group IDs and session IDs.

10.1 Processes

Creation

A new process is traditionally started using the fork() system call:

 pid_t p;
 
 p = fork();
 if (p == (pid_t) -1)
         /* ERROR */
 else if (p == 0)
         /* CHILD */
 else
         /* PARENT */

This creates a child as a duplicate of its parent. Parent and child are identical in almost all respects. In the code they are distinguished by the fact that the parent learns the process ID of its child, while fork() returns 0 in the child. (It can find the process ID of its parent using the getppid() system call.)

Termination

Normal termination is when the process does

 exit(n);

 return n;

from its main() procedure. It returns the single byte n to its parent.

Abnormal termination is usually caused by a signal.

Collecting the exit code. Zombies

The parent does

 pid_t p;
 int status;
 
 p = wait(&status);

and collects two bytes:

A process that has terminated but has not yet been waited for is a zombie. It need only store these two bytes: exit code and reason for termination.

On the other hand, if the parent dies first, init (process 1) inherits the child and becomes its parent.

Signals

Stopping

Some signals cause a process to stop: SIGSTOP (stop!), SIGTSTP (stop from tty: probably ^Z was typed), SIGTTIN (tty input asked by background process), SIGTTOU (tty output sent by background process, and this was disallowed by stty tostop).

Apart from ^Z there also is ^Y. The former stops the process when it is typed, the latter stops it when it is read.

Signals generated by typing the corresponding character on some tty are sent to all processes that are in the foreground process group of the session that has that tty as controlling tty. (Details below.)

If a process is being traced, every signal will stop it.

Continuing

SIGCONT: continue a stopped process.

Terminating

SIGKILL (die! now!), SIGTERM (please, go away), SIGHUP (modem hangup), SIGINT (^C), SIGQUIT (^\), etc. Many signals have as default action to kill the target. (Sometimes with an additional core dump, when such is allowed by rlimit.) The signals SIGCHLD and SIGWINCH are ignored by default. All except SIGKILL and SIGSTOP can be caught or ignored or blocked. For details, see signal(7).

10.2 Process groups

Every process is member of a unique process group, identified by its process group ID. (When the process is created, it becomes a member of the process group of its parent.) By convention, the process group ID of a process group equals the process ID of the first member of the process group, called the process group leader. A process finds the ID of its process group using the system call getpgrp(), or, equivalently, getpgid(0). One finds the process group ID of process p using getpgid(p).

One may use the command ps j to see PPID (parent process ID), PID (process ID), PGID (process group ID) and SID (session ID) of processes. With a shell that does not know about job control, like ash, each of its children will be in the same session and have the same process group as the shell. With a shell that knows about job control, like bash, the processes of one pipeline. like

 % cat paper | ideal | pic | tbl | eqn | ditroff > out

form a single process group.

Creation

A process pid is put into the process group pgid by

 setpgid(pid, pgid);

If pgid == pid or pgid == 0 then this creates a new process group with process group leader pid. Otherwise, this puts pid into the already existing process group pgid. A zero pid refers to the current process. The call setpgrp() is equivalent to setpgid(0,0).

Restrictions on setpgid()

The calling process must be pid itself, or its parent, and the parent can only do this before pid has done exec(), and only when both belong to the same session. It is an error if process pid is a session leader (and this call would change its pgid).

Typical sequence

 p = fork();
 if (p == (pid_t) -1) {
         /* ERROR */
 } else if (p == 0) {    /* CHILD */
         setpgid(0, pgid);
         ...
 } else {                /* PARENT */
         setpgid(p, pgid);
         ...
 }

This ensures that regardless of whether parent or child is scheduled first, the process group setting is as expected by both.

Signalling and waiting

One can signal all members of a process group:

 killpg(pgrp, sig);

One can wait for children in ones own process group:

 waitpid(0, &status, ...);

or in a specified process group:

 waitpid(-pgrp, &status, ...);

Foreground process group

Among the process groups in a session at most one can be the foreground process group of that session. The tty input and tty signals (signals generated by ^C, ^Z, etc.) go to processes in this foreground process group.

A process can determine the foreground process group in its session using tcgetpgrp(fd), where fd refers to its controlling tty. If there is none, this returns a random value larger than 1 that is not a process group ID.

A process can set the foreground process group in its session using tcsetpgrp(fd,pgrp), where fd refers to its controlling tty, and pgrp is a process group in the its session, and this session still is associated to the controlling tty of the calling process.

How does one get fd? By definition, /dev/tty refers to the controlling tty, entirely independent of redirects of standard input and output. (There is also the function ctermid() to get the name of the controlling terminal. On a POSIX standard system it will return /dev/tty.) Opening the name of the controlling tty gives a file descriptor fd.

Background process groups

All process groups in a session that are not foreground process group are background process groups. Since the user at the keyboard is interacting with foreground processes, background processes should stay away from it. When a background process reads from the terminal it gets a SIGTTIN signal. Normally, that will stop it, the job control shell notices and tells the user, who can say fg to continue this background process as a foreground process, and then this process can read from the terminal. But if the background process ignores or blocks the SIGTTIN signal, or if its process group is orphaned (see below), then the read() returns an EIO error, and no signal is sent. (Indeed, the idea is to tell the process that reading from the terminal is not allowed right now. If it wouldn't see the signal, then it will see the error return.)

When a background process writes to the terminal, it may get a SIGTTOU signal. May: namely, when the flag that this must happen is set (it is off by default). One can set the flag by

 % stty tostop

and clear it again by

 % stty -tostop

and inspect it by

 % stty -a

Again, if TOSTOP is set but the background process ignores or blocks the SIGTTOU signal, or if its process group is orphaned (see below), then the write() returns an EIO error, and no signal is sent.

Orphaned process groups

The process group leader is the first member of the process group. It may terminate before the others, and then the process group is without leader.

A process group is called orphaned when the parent of every member is either in the process group or outside the session. In particular, the process group of the session leader is always orphaned.

If termination of a process causes a process group to become orphaned, and some member is stopped, then all are sent first SIGHUP and then SIGCONT.

The idea is that perhaps the parent of the process group leader is a job control shell. (In the same session but a different process group.) As long as this parent is alive, it can handle the stopping and starting of members in the process group. When it dies, there may be nobody to continue stopped processes. Therefore, these stopped processes are sent SIGHUP, so that they die unless they catch or ignore it, and then SIGCONT to continue them.

Note that the process group of the session leader is already orphaned, so no signals are sent when the session leader dies.

Note also that a process group can become orphaned in two ways by termination of a process: either it was a parent and not itself in the process group, or it was the last element of the process group with a parent outside but in the same session. Furthermore, that a process group can become orphaned other than by termination of a process, namely when some member is moved to a different process group.

10.3 Sessions

Every process group is in a unique session. (When the process is created, it becomes a member of the session of its parent.) By convention, the session ID of a session equals the process ID of the first member of the session, called the session leader. A process finds the ID of its session using the system call getsid().

Every session may have a controlling tty, that then also is called the controlling tty of each of its member processes. A file descriptor for the controlling tty is obtained by opening /dev/tty. (And when that fails, there was no controlling tty.) Given a file descriptor for the controlling tty, one may obtain the SID using tcgetsid(fd).

A session is often set up by a login process. The terminal on which one is logged in then becomes the controlling tty of the session. All processes that are descendants of the login process will in general be members of the session.

Creation

A new session is created by

 pid = setsid();

This is allowed only when the current process is not a process group leader. In order to be sure of that we fork first:

 p = fork();
 if (p) exit(0);
 pid = setsid();

The result is that the current process (with process ID pid) becomes session leader of a new session with session ID pid. Moreover, it becomes process group leader of a new process group. Both session and process group contain only the single process pid. Furthermore, this process has no controlling tty.

The restriction that the current process must not be a process group leader is needed: otherwise its PID serves as PGID of some existing process group and cannot be used as the PGID of a new process group.

Getting a controlling tty

How does one get a controlling terminal? Nobody knows, this is a great mystery.

The System V approach is that the first tty opened by the process becomes its controlling tty.

The BSD approach is that one has to explicitly call

 ioctl(fd, TIOCSCTTY, ...);

to get a controlling tty.

Linux tries to be compatible with both, as always, and this results in a very obscure complex of conditions. Roughly:

The TIOCSCTTY ioctl will give us a controlling tty, provided that (i) the current process is a session leader, and (ii) it does not yet have a controlling tty, and (iii) maybe the tty should not already control some other session; if it does it is an error if we aren't root, or we steal the tty if we are all-powerful.

Opening some terminal will give us a controlling tty, provided that (i) the current process is a session leader, and (ii) it does not yet have a controlling tty, and (iii) the tty does not already control some other session, and (iv) the open did not have the O_NOCTTY flag, and (v) the tty is not the foreground VT, and (vi) the tty is not the console, and (vii) maybe the tty should not be master or slave pty.

Getting rid of a controlling tty

If a process wants to continue as a daemon, it must detach itself from its controlling tty. Above we saw that setsid() will remove the controlling tty. Also the ioctl TIOCNOTTY does this. Moreover, in order not to get a controlling tty again as soon as it opens a tty, the process has to fork once more, to assure that it is not a session leader. Typical code fragment:

         if ((fork()) != 0)
                 exit(0);
         setsid();
         if ((fork()) != 0)
                 exit(0);

Disconnect

If the terminal goes away by modem hangup, and the line was not local, then a SIGHUP is sent to the session leader. Any further reads from the gone terminal return EOF. (Or possibly -1 with errno set to EIO.)

If the terminal is the slave side of a pseudotty, and the master side is closed (for the last time), then a SIGHUP is sent to the foreground process group of the slave side.

When the session leader dies, a SIGHUP is sent to all processes in the foreground process group. Moreover, the terminal stops being the controlling terminal of this session (so that it can become the controlling terminal of another session).

Thus, if the terminal goes away and the session leader is a job control shell, then it can handle things for its descendants, e.g. by sending them again a SIGHUP. If on the other hand the session leader is an innocent process that does not catch SIGHUP, it will die, and all foreground processes get a SIGHUP.

10.4 Threads

A process can have several threads. New threads (with the same PID as the parent thread) are started using the clone system call using the CLONE_THREAD flag. Threads are distinguished by a thread ID (TID). An ordinary process has a single thread with TID equal to PID. The system call gettid() returns the TID. The system call tkill() sends a signal to a single thread.

Example: a process with two threads. Both only print PID and TID and exit. (Linux 2.4.19 or later.)

 % cat << EOF > gettid-demo.c
 #include <unistd.h>
 #include <sys/types.h>
 #define CLONE_SIGHAND   0x00000800
 #define CLONE_THREAD    0x00010000
 #include <linux/unistd.h>
 #include <errno.h>
 _syscall0(pid_t,gettid)
 
 int thread(void *p) {
         printf("thread: %d %d\n", gettid(), getpid());
 }
 
 main() {
         unsigned char stack[4096];
         int i;
 
         i = clone(thread, stack+2048, CLONE_THREAD | CLONE_SIGHAND, NULL);
         if (i == -1)
                 perror("clone");
         else
                 printf("clone returns %d\n", i);
         printf("parent: %d %d\n", gettid(), getpid());
 }
 EOF
 % cc -o gettid-demo gettid-demo.c
 % ./gettid-demo
 clone returns 21826
 parent: 21825 21825
 thread: 21826 21825
 %

11. Character devices

An ASR33 Teletype - origin of the abbreviation tty.

We meet several kinds of objects (character devices, tty drivers, line disciplines). Each registers itself at kernel initialization time (or module insertion time), and can afterwards be found when an open() is done.

11.1 Registration

A character device announces its existence by calling register_chrdev(). The call

 register_chrdev(major, name, fops);

stores the given name (a string) and fops (a struct file_operations *) in the entry of the array chrdevs[] indexed by the integer major, the major device number of the device.

(Devices have a number, the device number, a combination of major and minor device number. Traditionally, the major device number gives the kind of device, and the minor device number is some kind of unit number. However, there are no rules - it is best to consider a device number a cookie, without known structure.)

This stored entry is used again when the device is opened: The filesystem recognizes that the file that is being opened is a special device file, and invokes init_special_inode(). This routine does

 void init_special_inode(struct inode *inode, umode_t mode, dev_t rdev)
 {
         inode->i_mode = mode;
         if (S_ISCHR(mode)) {
                 inode->i_fop = &def_chr_fops;
                 inode->i_rdev = to_kdev_t(rdev);
                 inode->i_cdev = cdget(rdev);
         } else
                 ...
 }

Here to_kdev_t() converts the user mode version of the device number to the kernel version of the device number. The cdget() returns the struct char_device for this major number. It finds it using a hash table, and if we did not have it already, a new one is allocated. In all cases, the reference count of this struct is increased by one. The struct looks like

 struct char_device {
         struct list_head        hash;
         atomic_t                count;
         dev_t                   dev;
         atomic_t                openers;
         struct semaphore        sem;
 };

Here hash is a link in the chain of devices with the same hash, count is the number of references - each cdget() increases and each cdput() decreases this by one, and if it becomes zero, the struct is removed from the hash chain and freed. The field dev stores the device number, the only thing we know about this device. The fields openers and sem are unused. Access to the hash table is protected by the cdev_lock spinlock.

Finally the last item from init_special_inode():

 struct file_operations def_chr_fops = {
         .open = chrdev_open,
 };

That is, we cannot do anything with the character device except opening it, and when we do chrdev_open() is called.

On struct char_device

What is the use of this struct? After removing the unused fields openers and sem we see that we just have a struct in a hash chain and a reference count. It has no function at all, and all related code can be deleted (from 2.5.59).

11.2 Opening

The system call routine sys_open() calls filp_open(), and that calls dentry_open(), which does

 f->f_op = fops_get(inode->i_fop);

In other words, the file f_op is a copy of the inode i_fop. (This fops_get() returns its argument, but increments a reference count in case these file operations live in a module.) Finally, dentry_open() calls the inode open routine if there is one:

 if (f->f_op && f->f_op->open)
         f->f_op->open(inode,f);

Thus, it is here that chrdev_open() is called.

 int chrdev_open(struct inode *inode, struct file *filp) {
         int ret = -ENODEV;
 
         filp->f_op = get_chrfops(major(inode->i_rdev), minor(inode->i_rdev));
         if (filp->f_op) {
                 ret = 0;
                 if (filp->f_op->open)
                         ret = ret = filp->f_op->open(inode,filp);
         }
         return ret;
 }

And the routine get_chrfops() retrieves the struct file operations * that was registered:

 struct file_operations *get_chrfops(unsigned int major, unsigned int minor) {
         return fops_get(chrdevs[major].fops);
 }

(The actual routine checks whether the device did register already, and if not does a request_module("char-major-N") first, where N is the major number.)

We see that the inode fops remains unchanged, so that its open still points to chrdev_open(), but the file fops is changed and now points to what the device registered.

11.3 The tty driver

Let us focus on /dev/tty1, the first virtual console. Most code lives in drivers/char, in the files tty_io.c and n_tty.c and vt.c.

Registration

A tty driver announces its existence by calling tty_register_driver(). This call does a register_chrdev() (with tty_fops) and hangs the driver in the chain tty_drivers.

That chain is used by get_tty_driver(), a routine that given a device number finds the tty driver that handles the device with that number.

get_tty_driver

This routine is used in two places: in fs/char_dev.c:get_chrfops() and in tty_io.c:init_dev(), called from tty_open. The latter use was expected, but what is this strange first use?

 #define is_a_tty_dev(ma)        (ma == TTY_MAJOR || ma == TTYAUX_MAJOR)
 #define need_serial(ma,mi) (get_tty_driver(mk_kdev(ma,mi)) == NULL)
 
 static struct file_operations *
 get_chrfops(unsigned int major, unsigned int minor) {
         ...
         ret = fops_get(chrdevs[major].fops);
         if (ret && is_a_tty_dev(major) && need_serial(major,minor)) {
                 fops_put(ret);
                 ret = NULL;
         }
         if (!ret) {
                 char name[20];
                 sprintf(name, "char-major-%d", major);
                 request_module(name);
                 ret = fops_get(chrdevs[major].fops);
         }
         return ret;
 }

The idea here is that majors 4 and 5 (TTY_MAJOR and TTYAUX_MAJOR) may be served by several modules. Indeed, /dev/tty1 has major,minor 4,1 and is a virtual console, while /dev/ttyS1 has major,minor 4,65 and is a serial line. Thus, in drivers/serial/core.c:uart_register_driver() we see a call of tty_register_driver(), and this former routine is called, e.g., to register serial8250_reg, defined as

 struct uart_driver serial8250_reg = {
         .owner                  = THIS_MODULE,
         .driver_name            = "serial",
         .dev_name               = "ttyS%d",
         .major                  = TTY_MAJOR,
         .minor                  = 64,
         .nr                     = UART_NR,
         .cons                   = SERIAL8250_CONSOLE,
 };

while vt.c:vty_init() calls tty_register_driver() to register console_driver with major = TTY_MAJOR and minor_start = 1.

Opening

As we saw above, opening a character device ends up with calling the open routine from the struct file_operations registered by the device. In the case of a tty, the open routine in tty_fops is tty_open.

The routine tty_open is long and messy, with a lot of special purpose code for controlling ttys, for pseudottys, etc. In the ordinary case the essential part is

 tty_open(struct inode *inode, struct file *filp) {
         struct tty_struct *tty;
         kdev_t device = inode->i_rdev;
 
         init_dev(device, &tty);
         file->private_data = tty;
         tty->driver.open(tty,file);
 }

Thus, first of all, we create a tty_struct. Next, a pointer to this tty_struct is stored in the private_data field of the file struct, so that we can find it later, for example in tty_read():

 tty_read(struct file *file, char *buf, size_t count, ...) {
         struct tty_struct *tty = file->private_data;
         (tty->ldisc.read)(tty,file,buf,count);
 }

Finally we call the open routine of the driver. The field tty->driver was set in init_dev():

 init_dev(kdev_t device, struct tty_struct **ret_tty) {
         struct tty_driver *driver = get_tty_driver(device);
         struct tty_struct *tty = alloc_tty_struct();
 
         initialize_tty_struct(tty);
         tty->device = device;
         tty->driver = *driver;
         (tty->ldisc.open)(tty);
         *ret_tty = tty;
 }

Note that the entire struct tty_driver is copied in the assignment, so that individual fields can be changed without damaging the struct that was registered. However, this is never done, so having a copy is a waste of memory.

Line disciplines

The line discipline gives the protocol on the serial line. Each line discipline has a number, and the normal one is called N_TTY (0). Line disciplines are registered by tty_register_ldisc(), by storing a struct tty_ldisc in the array ldiscs[] (where the index is the line discipline number).

The normal discipline is registered by console_init(), as first among the registered disciplines:

 void __init console_init(void) {
         memset(ldiscs, 0, sizeof(ldiscs));
         tty_register_ldisc(N_TTY, &tty_ldisc_N_TTY);
         ...
 }

The call

         initialize_tty_struct(tty);

we saw in init_dev(), does among other things

         tty->ldisc = ldiscs[N_TTY];

Thus, when tty->ldisc.open is called, it is the open field of the struct tty_ldisc_N_TTY. This struct lives in n_tty.c and its open field is n_tty_open.

More opening

After this preparation, finally tty->driver.open(tty,file) is called. Now that we had /dev/tty1 in mind, that is, one of the virtual consoles, let us see what routine this is. In vt.c:vty_init() we see

         console_driver.open = con_open;
         ...
         tty_register_driver(&console_driver);

So, our open routine is con_open(), an amusing open routine. It creates a virtual console if there wasn't one. So, if you have 8 virtual consoles but open /dev/tty23 then you have 9.

If you have lots of unused consoles and want to free the memory they take, use the command deallocvt.

Exercise Which keystroke changes to console 23?

Reading

The system call sys_read() is found in fs/read_write.c. It calls vfs_read(), and this calls file->f_op->read(). In our case, this is the read routine of tty_fops, which unsurprisingly is tty_read. And above we saw that this calls tty->ldisc.read, which is the read field of tty_ldisc_N_TTY, called read_chan. The code is in n_tty.c. It downs the semaphore tty->atomic_read, hangs itself in the wait queue tty->read_wait of waiters for input, goes to sleep if no input is available, copies input to the user buffer, ups the semaphore tty->atomic_read and returns. (Reality is much more complicated. Try to read the code.)

So, hopefully, somebody will fill the input buffer. Who?

Keyboard interrupts arrive at input/keyboard/atkbd.c:atkbd_interrupt(). It handles the keyboard protocol and converts scancode to keycode. Then input_report_key() is called, a define for input_event(), and this routine offers the event to all registered handlers.

Now keyboard.c:kbd_init() registers kbd_handler, and the result is that keyboard keystrokes will be handled by keyboard.c:kbd_event(), which calls kbd_keycode(). Here keyboard raw, mediumraw, xlate and unicode modes are handled, as is the magic sysrequest key. Scancodes have already been converted to keycodes, here we convert back (yecch) for raw mode, leave things for mediumraw mode, or further convert keycodes to characters using the keymap (set by the utility loadkeys). Finally we call

         put_queue(vc, byte);

with the resulting bytes. Here vc is the foreground virtual console.

Now

 void put_queue(struct vc_data *vc, int ch) {
         struct tty_struct *tty = vc->vc_tty;
 
         tty_insert_flip_char(tty, ch, 0);
         con_schedule_flip(tty);
 }

that is, put_queue() retrieves vc->vc_tty that was set by con_open(), and puts its stuff in the flip buffer. Then the work of transporting this to the read_buffer is scheduled. (In tty raw mode that is a plain copy, but in canonical mode we must react to special characters: the erase character erases, the interrupt character sends an interrupt, etc.) And when the transporting has been done, the bytes are ready to be read by a read() call.

Writing

Here things are entirely analogous. The system call sys_write() calls vfs_write(), and this calls file->f_op->write(). In our case, this is the write routine of tty_fops, which is tty_write. It does do_tty_write(tty->ldisc.write, ...) which downs the semaphore tty->atomic_write, possibly splits up the write into smaller chunks, calls its first argument and ups the semaphore again.

The write routine here is the write field of tty_ldisc_N_TTY, called write_chan. The code is in n_tty.c. It hangs itself in the wait queue tty->write_wait of waiters for room for output, tries to write by calling tty->driver.write, and if that fails to write everything goes to sleep.

Now our driver was console_driver with write routine con_write that calls do_con_write. Here very obscure things are done to handle escape sequences (cursor movement, screen colours, scrolling, etc. etc.), but in the normal case we see

         scr_writew((attr << 8) + byte, screenpos);

that actually writes the character and the (foreground / background / intensity) attributes. All very messy code - not a joy to behold.

11.4 Raw devices

Raw devices are character devices that can be bound to block devices. I/O from/to raw devices bypasses the block caches. Whether that is desirable depends on the application. Usually it is undesirable - there are all kinds of issues with raw devices. A main problem is that of coherency - the block device should not also be accessed directly. An annoyance is that I/O buffers must be aligned. Very few standard programs do this. The code for the raw device does set_blocksize(), so that bad things happen if the device was open already and using a different blocksize. Really, if raw is used it must be the only access path to the block device.

private_data

The block device belonging to a raw device is noted down in the private_data field of the file struct.

ioctls

There are two ioctls: RAW_SETBIND and RAW_GETBIND. The former connects a given raw device to a block device specified by major, minor. The latter reports on a connection. The file descriptor needed for the ioctl is that of the control raw device, with minor number zero. Unbinding is done by binding to major,minor = 0,0.

Binding is done by setting the i_mapping field of the raw device inode to the i_mapping field of the block device. After rebinding this will crash certain kernels because the inode for the block device may have gone away.

11.5 The random device

For security purposes Linux has the devices /dev/random and /dev/urandom. The former produces cryptographically strong bits, but may block when no entropy is available. The latter uses bits from the former when available, and a strong random generator otherwise, and does not block.

Exercise Try dd if=/dev/urandom of=/dev/null bs=1024 count=1000 and immediately afterwards dd if=/dev/random of=/dev/null bs=1024 count=1. The former produces (more than) a megabyte of pseudorandom bits in less than a second. Probably this will have exhausted the entropy pool, and the latter will block until some randomness arrives. Move the mouse a little.

Randomness is needed in-kernel, e.g. for TCP sequence numbers - these must be hard to predict by an attacker to prevent spoofing -, and in user space for passwords or secret keys used to protect something - say the key for the .Xauthority file to protect access to the X server. The random character device is a standard part of the kernel, not something one selects with a config option.

The random device is a subdevice of the mem (for memory) device. The character device major 1 has subdevices mem, kmem, null, port, zero, full, random, urandom, kmsg (for minors 1,2,3,4,5,7,8,9,11 - long ago minor 6 was /dev/core, while minor 10 was reserved for /dev/aio but when aio was implemented it was done differently).

Thus, the registration is found in drivers/char/mem.c

Randomness is stored in the entropy_store, which has an associated variable entropy_count counting available random bits. The routine random_read() sees whether we have some bits, and if so returns them, and otherwise sleeps. The routine urandom_read() just extracts some bits.

So the question is how to obtain randomness. Something nobody can predict even when all running software is known. The random device uses four sources, namely the routines add_X_randomness, for X = keyboard, mouse, disk, interrupt. The keyboard, and the mouse, and each IRQ, and each disk have an associated structure

 struct timer_rand_state {
         __u32           last_time;
         __s32           last_delta,last_delta2;
         int             dont_count_entropy:1;
 };

that remembers when we last did something, and the first and second order differences in the sequence of points in time. The routines add_keyboard_randomness() etc. call add_timer_randomness(), and the current time and the value contributed by the routine (keyboard scancode, mouse data, etc.) are mixed into the pool. In order to estimate the amount of entropy added, only the time is used, not the scancode (etc.) data.

12. Sysfs and kobjects

Sysfs is a virtual filesystem that describes the devices known to the system from various viewpoints. By default it is mounted on /sys. The basic building blocks of the hierarchy are kobjects. The entire setup is strange and messy and will cause lots of bugs. Let us start looking at the reference counting part (nothing wrong with that).

12.1 atomic_t

An atomic_t is an integer variable that can be inspected and changed atomically. It is mostly used for reference counting and semaphores. On i386 it is defined as

 typedef struct { volatile int cnt; } atomic_t;

and available methods (with int i and atomic_t *v) are

 atomic_read(v)                  v->cnt
 atomic_set(v,i)                 v->cnt = i
 atomic_add(i,v)                 v->cnt += i
 atomic_sub(i,v)                 v->cnt -= i
 atomic_sub_and_test(i,v)        (v->cnt -= i) == 0
 atomic_inc(v)                   v->cnt++
 atomic_dec(v)                   v->cnt--
 atomic_dec_and_test(v)          --v->cnt == 0
 atomic_inc_and_test(v)          ++v->cnt == 0
 atomic_add_negative(i,v)        (v->cnt += i) < 0

Here the right hand sides define the intended effect, but the code must be written in assembler to guarantee atomicity. The definitions can be found in <asm/atomic.h>. E.g. for i386:

 static __inline__ void atomic_dec(atomic_t *v) {
         __asm__ __volatile__(
                 LOCK "decl %0"
                 :"=m" (v->cnt)
                 :"m" (v->cnt));
 }

where the LOCK becomes a lock prefix 0xf0 on an SMP machine. For the i386 a simple aligned read is atomic:

 #define atomic_read(v)          ((v)->cnt)

The contents used to be a 24-bit counter, because sparc used the low byte as a spinlock, e.g.,

 static inline int atomic24_read(const atomic24_t *v) {
         int ret = v->cnt;
         while(ret & 0xff)
                 ret = v->cnt;
         return ret >> 8;
 }

but since Linux 2.6.3 also sparc has a 32-bit counter. (The reason the Sparc is troublesome is that its 32-bit memory writes are not atomic: it is possible for another CPU to come and read memory and find some old and some new bytes, a partly written result.)

12.2 struct kref

A struct kref is an object that handles reference counting. It is defined as

 struct kref {
         atomic_t refcount;
 };

and available methods are (with struct kref *k and void (*release)(struct kref *k))

 void kref_init(k)               k->refcount = 1;
 void kref_get(k)                k->refcount++;
 void kref_put(k,release)        if (!--k->refcount) release(k);

Thus, today a struct kref is just an atomic_t. Still, it is a useful abstraction: it has far fewer methods, so reasoning about it is simpler.

Earlier the struct kref had a release field, but all uses turned out to be such that the caller knew the appropriate release function, so that it could be a parameter of kref_put(), saving memory.

Releasing just the struct kref would not be very meaningful. The typical use is shown by

 struct scsi_disk {
         struct scsi_driver *driver;
         struct scsi_device *device;
         struct kref     kref;
         ...
 };
 
 #define to_scsi_disk(obj) container_of(obj,struct scsi_disk,kref)
 
 static void scsi_disk_put(struct scsi_disk *sdkp) {
         down(&sd_ref_sem);
         ...
         kref_put(&sdkp->kref, scsi_disk_release);
         up(&sd_ref_sem);
 }
 
 static void scsi_disk_release(struct kref *kref)
 {
         struct scsi_disk *sdkp = to_scsi_disk(kref);
         ...
         kfree(sdkp);
 }

12.3 struct kobject

A struct kobject represents a kernel object, maybe a device or so, such as the things that show up as directory in the sysfs filesystem.

It is defined as

 #define KOBJ_NAME_LEN   20
 
 struct kobject {
         char                    *k_name;
         char                    name[KOBJ_NAME_LEN];
         struct kref             kref;
         struct list_head        entry;
         struct kobject          *parent;
         struct kset             *kset;
         struct kobj_type        *ktype;
         struct dentry           *dentry;
 };

Every struct kobject has a name, which must not be NULL. The name is kobj->k_name, and this pointer points either to the internal array, if the name is short, or to an external string obtained from kmalloc() and to be kfree()d when the kobject dies. However, we are not supposed to know this - the name is returned by kobject_name(kobj) and set by kobject_set_name(kobj, format, ...). (But that latter routine cannot fail in case the name we use has length less than 20.) The name is set independently of other initialization.

A struct kobject may be member of a set, given by the kset field. Otherwise, this field is NULL. The kset field must be set before calling kobject_init().

The entry field is either empty or part of the circularly linked list containing members of the kset.

A struct kobject is reference counted. The routines kobject_get() and kobject_put() do get/put on the kref field. When the reference count drops to zero, a kobject_cleanup() is done.

A struct kobject must be initialized by kobject_init(). This does the kref_init that sets the refcount to 1, initializes the entry field to an empty circular list, and does

         kobj->kset = kset_get(kobj->kset);

which does not change kobj->kset but increments its refcount: one more element of the set. Note that most fields are not touched by kobject_init(). One should memset it to zero and possibly assign kset before calling kobject_init().

One of the main purposes of a struct kobject is to appear in the sysfs tree. It is added in the tree by kobject_add() and deleted again by kobject_del(). The former calls kobject_hotplug("add",kobj), the latter kobject_hotplug("remove", kobj).

The sysfs dentry is given by the field dentry. The parent directory in the tree is represented by the kobject parent. The routine sysfs_create_dir() will hang a new directory directly below the root /sys when no parent is given.

Let us list the kobject methods (with struct kobject *ko)

 int kobject_set_name(ko, char *, ...)
 char *kobject_name(ko)
 void kobject_init(ko)
 struct kobject *kobject_get(ko)
 void kobject_put(ko)
 void kobject_cleanup(ko)
 int kobject_add(ko)
 void kobject_del(ko)
 int kobject_register(ko)
 void kobject_unregister(ko)
 int kobject_rename(ko, char *new_name)
 void kobject_hotplug(const char *action, ko)
 char *kobject_get_path(struct kset *, ko, int)

Most of these were mentioned above. The routine kobject_register() does kobject_init(); kobject_add(), and kobject_unregister() does kobject_del(); kobject_put(). This is the preferred way of handling kobjects that are represented in the sysfs tree.

The rest will be discussed below.

12.4 struct kset

A struct kset represents a set of kernel objects. It is defined as

 struct kset {
         struct subsystem        *subsys;
         struct kobj_type        *ktype;
         struct list_head        list;
         struct kobject          kobj;
         struct kset_hotplug_ops *hotplug_ops;
 };

The field list provides a circularly linked list of the kobjects that are members of the kset. All kobjects on the list have a kset field that points back to us. In order to examine or manipulate the list, one needs to hold the kset->subsys->rwsem semaphore.

The set is also itself a kobject, given by the field kobj.

Methods are (with struct kset *ks)

 void kset_init(ks)
 struct kset *to_kset(ko)
 struct kset *kset_get(ks)
 void kset_put(ks)
 int kset_add(ks)
 int kset_register(ks)
 void kset_unregister(ks)
 struct kobject *kset_find_obj(ks, char *)

The routine to_kset() converts a pointer to the kobject field of a kset into a pointer to the kset itself.

The routine kset_init() initializes the fields list and kobj. The routine kset_get() does

         to_kset(kobject_get(&ks->kobj))

a rather complicated way of doing kobject_get(&ks->kobj) and returning ks again. The routine kset_put() does kobject_put(&ks->kobj).

The routine kset_add() does a kobject_add() of its kobject. If that kobject did not yet have a parent or kset and we have a subsys, set the kobject's parent to the kobject of the kset of the subsys. On the other hand, if that kobject has a kset but no parent, then kobject_add() will set the parent to the kset itself.

Just like for kobjects, we have kset_register() that does kset_init(); kset_add(), and kset_unregister() that just does kobject_unregister().

Finally, there is kset_find_obj() that given a name and a kset returns the kobject in the kset with that name.

12.5 struct kobj_type

Both a kobject and a kset have a field of type struct kobj_type. Such a struct represents a type of objects, and will hold the methods used to operate on them. The definition is

 struct kobj_type {
         void (*release)(struct kobject *);
         struct sysfs_ops        * sysfs_ops;
         struct attribute        ** default_attrs;
 };

Here the release function is what is called by kobject_cleanup(). It uses the ktype of the kset of the kobject, or, there is none, the ktype of the kobject itself - this is what is returned by the method get_ktype().

Method:

 struct kobj_type *get_ktype(ko)

The attributes describe the ordinary files in the sysfs tree. It is a NULL-terminated list of

 struct attribute {
         char                    *name;
         struct module           *owner;
         mode_t                  mode;
 };

The contents of these files is generated by show() and can possibly be modified by the store() function:

 struct sysfs_ops {
         ssize_t (*show)(kobj, struct attribute *attr, char *buf);
         ssize_t (*store)(kobj, struct attribute *attr, const char *, size_t);
 };

12.6 struct subsystem

Defined by

 struct subsystem {
         struct kset             kset;
         struct rw_semaphore     rwsem;
 };

A kset belongs to a subsystem, and the rwsem of the subsystem is used to protect its list.

Subsystems tend to correspond to toplevel directories in the sysfs hierarchy. Their names in the source tend to end in _subsys (produced by the macro decl_subsys()). I see

 % ls /sys
 block  bus  class  devices  firmware  module
 %

and find in the source system_subsys, block_subsys, bus_subsys, class_subsys, devices_subsys, firmware_subsys, class_obj_subsys, acpi_subsys, edd_subsys, vars_subsys, efi_subsys, cdev_subsys, module_subsys, power_subsys, pci_hotplug_slots_subsys.

Here edd lives below firmware, and system below devices.

Some are not visible in sysfs. We'll look at some of these below.

12.7 struct kobj_map

Device number handling is done using the struct kobj_map. See drivers/base/map.c.

 typedef struct kobject *kobj_probe_t(dev_t, int *, void *);
 
 struct kobj_map {
         struct probe {
                 struct probe *next;
                 dev_t dev;
                 unsigned long range;
                 struct module *owner;
                 kobj_probe_t *get;
                 int (*lock)(dev_t, void *);
                 void *data;
         } *probes[255];
         struct rw_semaphore *sem;
 };

with methods

 int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
              struct module *owner, kobj_probe_t *get,
              int (*lock)(dev_t, void *), void *data);
 void kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range);
 struct kobject *kobj_lookup(struct kobj_map *domain, dev_t dev, int *index);
 struct kobj_map *kobj_map_init(kobj_probe_t *base_probe, struct subsystem *s);

Each struct probe describes a device number interval starting at dev and with length range. With lots of devices one would need some fast data structure to find stuff, but for the time being the current version suffices. It uses the major device number mod 255 as index in the array, where each entry is head in a linked list of such structs. Handling the linked lists is protected by the semaphore *sem, set to the subsystem semaphore at init time.

The call kobj_lookup() finds the given device number dev on the given map domain (there are two: device numbers for block devices and device numbers for character devices). If the lock function is present it will be called, and the present probe skipped if it returns an error. Then the get function is called to get the kobject for the given device number. If the owner field is set, we take a reference on the corresponding module via try_module_get(owner) in order to protect the lock and get calls. The resulting kobject is returned as value, and the offset in the interval of device numbers is returned via index.

It is possible to have several nested intervals of device numbers, and the lists are sorted such that smaller intervals come first in the linked lists, so that the most specific entry overrides other entries. The final entry is the interval [1,~0] covering all device numbers.

The call kobj_map() adds an entry with the given data. The call kobj_unmap() removes it again. The call kobj_map_init() initializes the map. The subsystem argument provides the semaphore. The base_probe() parameter is the get function for the interval covering all device numbers.

There are two calls of kobj_map_init(), namely

         bdev_map = kobj_map_init(bbase_probe, &block_subsys);

and

         cdev_map = kobj_map_init(cbase_probe, &cdev_subsys);

Here the base_probe functions are

 struct kobject *bbase_probe(dev_t dev, int *part, void *data) {
         if (request_module("block-major-%d-%d", MAJOR(dev), MINOR(dev)) > 0)
                 request_module("block-major-%d", MAJOR(dev));
         return NULL;
 }

and

 struct kobject *cbase_probe(dev_t dev, int *part, void *data) {
         if (request_module("char-major-%d-%d", MAJOR(dev), MINOR(dev)) > 0)
                 request_module("char-major-%d", MAJOR(dev));
         return NULL;
 }

The existence of bdev_map is hidden by the helper functions blk_register_region(), blk_unregister_region(), get_gendisk() that do a kobj_map(), kobj_unmap() and kobj_lookup(), respectively.

12.8 Example: floppy

For example, the only part of floppy.c that refers to kobjects is the routine

 struct kobject *floppy_find(dev_t dev, int *part, void *data) {
         int drive = (*part & 3) | ((*part & 0x80) >> 5);
         if (something wrong)
                 return NULL;
         *part = 0;
         return get_disk(disks[drive]);
 }

that is made the probe function via

         blk_register_region(MKDEV(FLOPPY_MAJOR, 0), 256, THIS_MODULE,
                             floppy_find, NULL, NULL);

Here get_disk() is responsible for doing try_module_get() and kobject_get(), and the kobject involved is the one embedded in the struct gendisk.

In floppy_init() we see

         disks[dr] = alloc_disk(1);
         ...
         sprintf(disks[dr]->disk_name, "fd%d", dr);
         ....
         add_disk(disks[dr]);

Here disks[] is an array of struct gendisks, one for each device. The alloc_disk() routine allocates one and initializes some things. In particular, it sets the kset of the embedded kobject to block_subsys.kset. Next,

 void add_disk(struct gendisk *disk) {
         disk->flags |= GENHD_FL_UP;
         blk_register_region(MKDEV(disk->major, disk->first_minor),
                             disk->minors, NULL, exact_match, exact_lock, disk);
         register_disk(disk);
         blk_register_queue(disk);
 }

Here exact_lock() takes a reference on disk, and exact_match returns the kobject embedded in disk. See how the data field is used here to store the disk pointer.

The register_disk() copies the disk name disk->disk_name (set above to e.g. "fd0") to disk->kobj.name, replacing embedded slashes by exclamation marks, so that the name is suitable as a file name. Then kobject_add() puts the thing in the sysfs tree. Where? Below the subsys, that is, as /sys/block/fd0, because that is where alloc_disk() pointed us.

The blk_register_queue adds a subdirectory queue below /sys/block/fd0.

12.9 Hotplug

One of the fields of a kset is a struct kset_hotplug_ops defined by (in abbreviated notation)

 struct kset_hotplug_ops {
         int (*filter)(kset, kobj);
         char *(*name)(kset, kobj);
         int (*hotplug)(kset, kobj, char **envp, int num_envp, char *buf, int bufsz);
 };

The routine kobject_hotplug() is called by kobject_add() when the kobject is added to the hierarchy. It needs a kset in order to call one of the kset_hotplug_ops functions, and to this end walks up the hierarchy along the parent pointers until a kobj is found with a non-NULL kset. Then it calls kset_hotplug(action, kset, orig_kobj).

Now kset_hotplug() first calls the filter function if there is one, and does not do anything if that returns false. It doesn't do anything either when hotplug_path is the empty string.

(You can set hotplug_path by echoing to /proc/sys/kernel/hotplug. The default value is "/sbin/hotplug".)

Now the executable with pathname found in hotplug_path is called with a single parameter, found by calling the name() function, and if that doesnt yield a name, taking the name of the (kobject of the) kset. The process gets an environment with HOME=/, PATH=/sbin:/bin:/usr/sbin:/usr/bin, ACTION=..., SEQNUM=..., DEVPATH=..., where the path given by kobject_get_path() is the path in the sysfs hierarchy to the kobj. Finally, the hotplug() function may add some more environment parameters. Then the executable is launched.

Оставьте свой комментарий !

Ваше имя:
Комментарий:
Оба поля являются обязательными

Автор Комментарий к данной статье

Linux kernel

Andries Brouwer, aeb@cwi.nl

Binary formats

Example of an OMAGIC file

Shared and static libraries

Personality

Change

Ancient history

Some details

Filesystem character set

ASCII

ISO 8859-1

Unicode

UTF-8

Links

Device special files

Other file types

Sparse files

Dentries

Mounts

Per-process namespaces

Layout

Boot sector

FAT16

FAT32

FAT

The first two FAT entries

FAT12

FAT16

FAT32

Directory Entry

VFAT

FSInfo sector

Variations

Layout

The superblock

The group descriptors

The inode table

History

File sizes

Reserved inodes

Directories

Limits and efficiency

Fragments

Attributes

More

Linux Journaling Filesystems

Setup

Theory

The protocol

XDR

RPC

NFS

Packets on the wire

Ethernet

IP

UDP

RPC

NFS

The reply

Security

An example

seqfiles

name

get_sb

kill_sb

fs_flags

owner

Example of the use of owner - sysfs

next

fs_supers

s_lock_key, s_umount_key

The super_blocks list

The fs_supers list

The file list

The list of anonymous dentries

The inode lists s_dirty, s_io

The dentry list

The hash list

i_list

Andries Brouwer, `aeb@cwi.nl`