Serg Iakovlev

or:

and:

LINUX

Language

Kernel

Package

Book

Test

Forum

iakovlev.org

Основные характеристики Ext2

Юникс использует несколько типов файловых систем. Хотя все они имеют простой набор атрибутов для файлов, реализация их все же различна.

Первая версия линукса была основана на файловой системе MINIX . Затем появилась расширенная файловая система (Ext FS). 2-я ее версия (Ext2) появилась в 1994; она оказалась весьма эффективной.

Особенности Ext2:

При создании файловой системы Ext2, админ может выбрать оптимальный размер блока - от 1 до 4 килобайт. Если предполагается большое число мелких файлов,лучше выбрать блок в 1 килобайт, в этом случае фрагментация будет меньше. Для файловой системы с файлами большого размера предпочтителен бОльший размер блока из-за меньшего оверхеда файловой системы и улучшения производительности диска.
При создании Ext2, админ может выбрать максимальное количество нод для партиции , это опять же может уменьшить оверхед и будет эффективнее использовать дисковое пространство.
Диск разбивается на группы блоков. Каждая группа включает блоки данных и ноды,которые хранятся в смежных треках. Благодаря этому файл целиком хранится в единой группе, благодаря чему доступ к нему оптимален.
Файловая система резервирует место для регулярных файлов. Когда такой файл увеличивается в размере, место для этого уже есть,благодаря чему уменьшается фрагментация.
Поддерживаются линки на файлы. Если его путь менее 60 символов,он хранится в ноде и для его трансляции не нужно чтения дополнительного блока данных.

Тщательно продуманная реализация файловых апдэйтов. При создании линка на файл, счетчик линков в дисковой ноде увеличится на единицу, и в указанной директории появится новое имя. Если с нодой что-то произойдет, удаление файла не приведет к катастрофическим последствиям.
Поддержка автоматической проверки файловой системы при загрузке. Проверка выполняется программой e2fsck, которая может быть выполнена после определенного числа монтирований файловой системы или после определенного промежутка времени.
Поддержка т.н. immutable файлов (они не могут быть модифицированы,удалены или переименованы) , а также т.н. append-only файлов ( данные добавляются в конец).
Совместимость с Unix System V Release 4 и BSD в части групповых прав при создании файла. В SVR4, новый созданный файл имеет user group ID процесса , его создавшего, в BSD, новый файл наследует user group ID каталога. Ext2 включает обе опции .

Ext2 - развивающаяся файловая система. Одни фичи доступны в патчах,другие только планируются. Наиболее важные из них:

Блочная фрагментация: Обычно выбирается большой размер блока , поскольку приложения имеют дело с файлами большого размера. При этом маленькие файлы оказываются нерационально разбросаны по диску. Эту проблему можно решить путем хранения таких файлов в различных фрагментах одного и того же блока.
Прозрачное сжатие и криптование файлов: Позволят пользователям прозрачно хранить сжатые/криптованные версии файлов на диске.
Управление удалением: Опция undelete позволит пользователям восстанавливать содержимое удаленных файлов.
Журналирование: Журналирование позволит делать автоматическую проверку при крахе файловой системы.

Журналирование не было включено в Ext2, оно появилось позже в Ext3.

Структура данных Ext2

Первый блок в любой Ext2 всегда зарезервирован для бут-сектора. Остальная часть партиции разбита на группы блоков, каждая из которых имеет вид как на рисунке Figure 18-1. Все группы имеют одинаковый размер и расположены последовательно, для того чтобы ядро могло обращаться к ним просто по индексу.

Figure 18-1. Layouts of an Ext2 partition and of an Ext2 block group

Групповая организация уменьшает файловую фрагментацию, потому что файл сохраняется по возможности в одну группу. Каждый блок в группе состоит из следующих частей:

копия суперблока
копия дескрипторов блоков данной группы
data block bitmap
inode bitmap
таблица нод
собственно сами данные

Если блок не содержит информации,говорят,что он свободен.

Как мы видим из рисунка Figure 18-1, суперблок и дескрипторы дублируются в каждом блоке. Когда e2fsck выполняет проверку, она ссылается на суперблок и на дескрипторы,которые хранятся в нулевом блоке, и потом копирует их в остальные блоки. Если по какой-то причине они окажутся в нулевом блоке испорченными, утилита берет их из других блоков.

Число групп зависит от размера партиции и от размера самого блока. Для этого используется block bitmap, который хранится в специальном блоке. В каждой группе может быть 8xb блоков , где b = размер блока. Число групп равно s/(8xb), где s - размер партиции в блоках.

Например рассмотрим партицию на 32 гига с 4-KB размером блока. Т.о. в 4-байтным блоке находится битмап , который описывает 32-килобайтный блок, размером 128 MB. Нам нужно 256 групп.

Суперблок

Ext2 суперблок хранится в структуре ext2_super_block , поля которой перечислены в таблице Table 18-1.^[*]
Типы данных _ _u8, _ _u16, _ _u32 указывают на 8, 16, 32 бита, типы данных _ _s8, _ _s16, _ _s32 для знаковых чисел 8, 16, 32.
Байт , 2 байта , 4 байта - типы данных _ _le16, _ _le32, _ _be16, _ _be32 ;
порядок байт little-endian ordering для 2-х и 4-байтных данных , остальные типы - порядок байт big-endian ordering (старшие байты хранятся в старших адресах).

^[*] Для совместимости Ext2 и Ext3 , структура ext2_super_block включает некоторые Ext3-специфические поля, которые не показаны в Table 18-1.

Table 18-1. Поля суперблока Ext2
Type
Field
Description
_ _le32
s_inodes_count
Общее число нод
_ _le32
s_blocks_count
Размер файловой системы в блоках
_ _le32
s_r_blocks_count
Число зарезервированных блоков
_ _le32
s_free_blocks_count
Число свободных блоков
_ _le32
s_free_inodes_count
Число свободных нод
_ _le32
s_first_data_block
Номер первого используемого блока (всегда 1)
_ _le32
s_log_block_size
Размер блока
_ _le32
s_log_frag_size
Размер фрагмента
_ _le32
s_blocks_per_group
Number of blocks per group
_ _le32
s_frags_per_group
Number of fragments per group
_ _le32
s_inodes_per_group
Number of inodes per group
_ _le32
s_mtime
Time of last mount operation
_ _le32
s_wtime
Time of last write operation
_ _le16
s_mnt_count
Mount operations counter
_ _le16
s_max_mnt_count
Number of mount operations before check
_ _le16
s_magic
Magic signature
_ _le16
s_state
Status flag
_ _le16
s_errors
Behavior when detecting errors
_ _le16
s_minor_rev_level
Minor revision level
_ _le32
s_lastcheck
Time of last check
_ _le32
s_checkinterval
Time between checks
_ _le32
s_creator_os
OS where filesystem was created
_ _le32
s_rev_level
Revision level of the filesystem
_ _le16
s_def_resuid
Default UID for reserved blocks
_ _le16
s_def_resgid
Default user group ID for reserved blocks
_ _le32
s_first_ino
Number of first nonreserved inode
_ _le16
s_inode_size
Size of on-disk inode structure
_ _le16
s_block_group_nr
Block group number of this superblock
_ _le32
s_feature_compat
Compatible features bitmap
_ _le32
s_feature_incompat
Incompatible features bitmap
_ _le32
s_feature_ro_compat
Read-only compatible features bitmap
_ _u8 [16]
s_uuid
128-bit filesystem identifier
char [16]
s_volume_name
Volume name
char [64]
s_last_mounted
Pathname of last mount point
_ _le32
s_algorithm_usage_bitmap
Used for compression
_ _u8
s_prealloc_blocks
Number of blocks to preallocate
_ _u8
s_prealloc_dir_blocks
Number of blocks to preallocate for directories
_ _u16
s_padding1
Alignment to word
_ _u32 [204]
s_reserved
Nulls to pad out 1,024 bytes

Поле s_inodes_count хранит число нод, поле s_blocks_count хранит число блоков.

Поле s_log_block_size - размер блока : 0 - блок размером 1 байт, 1 - блок размером 2 байта , и т.д. s_log_frag_size равно s_log_block_size, поскольку фрагментация блока еще не реализована.

s_blocks_per_group, s_frags_per_group, s_inodes_per_group хранят число блоков,фрагментов,нод в каждой группе блоков.

Несколько блоков зарезервировано за админом - они позволяют ему работать даже в случае переполнения.

s_mnt_count, s_max_mnt_count, s_lastcheck, s_checkinterval - это настроечные поля для автоматической проверки Ext2 при загрузке. От них зависит запуск e2fsck после определенного числа монтирований системы или определенного времени. Поле s_state хранит 0 - если файловая система в процессе монтирования, 1 - если отмонтирована, 2 - если есть ошибки.

Групповые дескрипторы и битмап

Каждая группа блоков имеет свой дескриптор, который описан структурой ext2_group_desc в Table 18-2.

Table 18-2. The fields of the Ext2 group descriptor
Type
Field
Description
_ _le32
bg_block_bitmap
Block number of block bitmap
_ _le32
bg_inode_bitmap
Block number of inode bitmap
_ _le32
bg_inode_table
Block number of first inode table block
_ _le16
bg_free_blocks_count
Number of free blocks in the group
_ _le16
bg_free_inodes_count
Number of free inodes in the group
_ _le16
bg_used_dirs_count
Number of directories in the group
_ _le16
bg_pad
Alignment to word
_ _le32 [3]
bg_reserved
Nulls to pad out 24 bytes

bg_free_blocks_count, bg_free_inodes_count, bg_used_dirs_count используются при создании нод и блоков данных. Они определяют блоки , в которых размещаются данные. Битмап - это последовательность бит, 0 означает , что нода или блок данных свободен , 1 - занят. Поскольку блок кратен 1,2 или 4 байтам , один битмап может описывать состояние 8,192, 16,384, или 32,768 блоков.

Таблица нод

Таблица нод состоит из последовательности блоков,каждый из которых состоит из определенного числа нод. Номер первого блока в этой таблице хранится в поле bg_inode_table группового дескриптора.

Все ноды имеют одинаковый размер: 128 байт. В одно-байтовом блоке находится 8 нод, а в 4-байтном всего 32. Для получения общего числа блоков в таблице, нужно разделить общее число нод в группе - это поле в суперблоке , которое называется s_inodes_per_group - на число нод в блоке.

Нода представлене структурой ext2_inode : Table 18-3.

Table 18-3. The fields of an Ext2 disk inode
Type
Field
Description
_ _le16
i_mode
File type and access rights
_ _le16
i_uid
Owner identifier
_ _le32
i_size
File length in bytes
_ _le32
i_atime
Time of last file access
_ _le32
i_ctime
Time that inode last changed
_ _le32
i_mtime
Time that file contents last changed
_ _le32
i_dtime
Time of file deletion
_ _le16
i_gid
User group identifier
_ _le16
i_links_count
Hard links counter
_ _le32
i_blocks
Number of data blocks of the file
_ _le32
i_flags
File flags
union
osd1
Specific operating system information
_ _le32 [EXT2_N_BLOCKS]
i_block
Pointers to data blocks
_ _le32
i_generation
File version (used when the file is accessed by a
network filesystem)
_ _le32
i_file_acl
File access control list
_ _le32
i_dir_acl
Directory access control list
_ _le32
i_faddr
Fragment address
union
osd2
Specific operating system information

Многие поля похожи на соответствующие поля ноды в VFS.

Например, поле i_size хранит длину файла в байтах, поле i_blocks хранит число блоков данных(по 512 байт), которые выделены под файл.

Поля i_size и i_blocks не связаны. Файл всегда состоит из целого числа блоков, непустой файл состоит минимум мз одного блока, и i_size может быть меньше чем 512 x i_blocks, но может быть и больше чем 512 , умноженное на i_blocks.

Поле i_block - массив указателей EXT2_N_BLOCKS (обычно 15) на блоки,используемые для данных,выделенных под файл.

32 бита , зарезервированные полем i_size , ограничивают размер файла в 4 GB. Старший бит в поле i_size не используется, поэтому максимальный размер ограничен 2 GB. Ext2 разрешает большие файлы для 64-bit AMD , Opteron или IBM's PowerPC G5. Поле i_dir_acl не используется, оно представляет 32-битное расширение для i_size. Размер файла хранится в ноде как 64-битное число. 64-версия работает как для 32-биной , так и для 64-битной архитектур. В 32-битной версии , доступ к большим файлам невозможен до тех пор , пока не включен флаг O_LARGEFILE.

Одно из основных условий VFS - каждый файл должен иметь уникальный нодовый номер. Для Ext2 , на диске не хранится маппинг между номером ноды и номером блока, поскольку последний производен от номера блоковой группы и имеет позицию внутри таблицы нод. Предположим , каждая группа блоков состоит из 4,096 нод , и нам нужно знать дисковый адрес ноды 13,021. Нода принадлежит 3-й нодовой группе и ее дисковый адрес хранится в 733-й строке нодовой таблицы. Номер ноды используется Ext2 для быстрого получения нодового дескриптора.

Расширенные атрибуты ноды

Формат ноды Ext2 накладывает ограничения для разработчиков файловых систем. Длина ноды должна быть кратна 2 для избежания фрагментации. Почти все поля структуры ноды заняты. Расширение ноды до 256 байт неоправданно , включая проблемы совместимости.

Для преодоления этих ограничений были введены т.н. расширенные атрибуты. Они хранятся в блоке,размещенном за пределами нод. Поле i_file_acl ноды указывает на блок , в котором находятся эти расширенные атрибуты. Различные ноды могут указывать на один такой блок.

Каждый расширенный атрибут имеет имя и значение. Они составляют массив переменной длины , определенный дескриптором ext2_xattr_entry . Рисунок 18-2 показывает слой расширенных атрибутов внутри блока. Каждый атрибут разбит на 2 части: имя атрибута размещается в начале блока, а значение в конце. Названия блоков упорядочены по имени,в то время как значения уже находятся в соответствии со своими именами.

Figure 18-2. Layout of a block containing extended attributes

Есть множество системных вызовов для установки , получения , удаления расширенных атрибутов файла. Системные вызовы setxattr( ) , lsetxattr( ) , fsetxattr( ) устанавливают атрибуиы. Аналогично , системные вызовы getxattr( ) , lgetxattr( ) , fgetxattr( ) возвращают значение расширенного атрибута. Вызовы listxattr( ), llistxattr( ) , flistxattr( ) получают список атрибутов. Вызовы removexattr( ) , lremovexattr( ) , fremovexattr( ) удаляют расширенный атрибут у файла .

Access Control Lists

Access control lists - механизм защиты файлов в юниксовых системах. Каждому файлу присваиваются привилегии группы , к которой принадлежит хозяин файла.

Linux 2.6 поддерживает ACLs с помощью расширенных атрибутов. Собственно , расширенные атрибуты и были введены для поддержки ACL. Функции chacl( ) , setfacl( ) , getfacl( ) , которые позволяют манипуляции с файловым ACL, производны от системных вызовов setxattr( ) and getxattr( ).

К сожалению , расширение POSIX 1003.1 не принято в качестве стандарта . И получается, что ACL поддерхиваются в большом количестве юниксовых систем , но реализация у них различная .

Как различные типы файлов используют дисковые блоки

Различные типы файлов Ext2 (regular files, pipes, etc.) используют блоки по-разному. Некоторые из них вообще не нуждаются в блоках. Типы файлов перечислены в таблице Table 18-4.

Table 18-4. Ext2 file types
File_type
Description
0
Unknown
1
Regular file
2
Directory
3
Character device
4
Block device
5
Named pipe
6
Socket
7
Symbolic link

Регулярный файл

Наиболее распространенный тип файлов. Регулярный файл нуждается в блоках данных , если он включает какую-то информацию. При создании такой файл пуст и блоки данных ему не нужны. Если вы наберете команду >filename, будет создан новый файл либо обнулен существующий.

Каталог

В Ext2 директории реализованы как файл, в котором блоки данных хранят имена файлов с соответствующими нодовыми номерами. Они включают структуру ext2_dir_entry_2. Поля этой структуры показаны в таблице Table 18-5. Структура имеет произвольную длину, поскольку последнее поле name есть длина массива EXT2_NAME_LEN (обычно 255). По соображениям эффективности эта длина кратна 4 , и при необходимости null-символы добавляются в конец файла. Поле name_len хранит размер файла (Figure 18-3).

Table 18-5. The fields of an Ext2 directory entry
Type
Field
Description
_ _le32
inode
Inode number
_ _le16
rec_len
Directory entry length
_ _u8
name_len
Filename length
_ _u8
file_type
File type
char [EXT2_NAME_LEN]
name
Filename

Поле file_type хранит значение , которое определяет тип файла (Table 18-4). Поле rec_len есть указатель на следующий каталог , это смещение добавляется к стартовому адресу текущей директории для получения следующего каталога. Для удаления каталога поле inode устанавливается в 0 и уменьшается значение поля rec_len предыдущего . Обратите внимание на поле rec_len field of Figure 18-3 ; oldfile было удалено , потому что поле rec_len usr установлено в 12+16 (длина usr и oldfile ).

Figure 18-3. An example of the Ext2 directory

Симлинк

Если симлинк включает до 60 символов , он хранится в поле i_block ноды, который включает массив из 15 4-байтовых чисел; блоки данных здесь не требуются. Если симлинк превышает 60 символов, блок данных нужен по-любому.

Файлы устройств , pipe, socket

Блоков данных здесь не требуется. Вся информация хранится в ноде.

Ext2 Memory Data Structures

For the sake of efficiency, most information stored in the disk data structures of an Ext2 partition are copied into RAM when the filesystem is mounted, thus allowing the kernel to avoid many subsequent disk read operations. To get an idea of how often some data structures change, consider some fundamental operations:

When a new file is created, the values of the s_free_inodes_count field in the Ext2 superblock and of the bg_free_inodes_count field in the proper group descriptor must be decreased.
If the kernel appends some data to an existing file so that the number of data blocks allocated for it increases, the values of the s_free_blocks_count field in the Ext2 superblock and of the bg_free_blocks_count field in the group descriptor must be modified.
Even just rewriting a portion of an existing file involves an update of the s_wtime field of the Ext2 superblock.

Because all Ext2 disk data structures are stored in blocks of the Ext2 partition, the kernel uses the page cache to keep them up-to-date (see the section "Writing Dirty Pages to Disk" in Chapter 15).

Table 18-6 specifies, for each type of data related to Ext2 filesystems and files, the data structure used on the disk to represent its data, the data structure used by the kernel in memory, and a rule of thumb used to determine how much caching is used. Data that is updated very frequently is always cached; that is, the data is permanently stored in memory and included in the page cache until the corresponding Ext2 partition is unmounted. The kernel gets this result by keeping the page's usage counter greater than 0 at all times.

Table 18-6. VFS images of Ext2 data structures
Type
Disk data structure
Memory data structure
Caching mode
Superblock
ext2_super_block
ext2_sb_info
Always cached
Group descriptor
ext2_group_desc
ext2_group_desc
Always cached
Block bitmap
Bit array in block
Bit array in buffer
Dynamic
inode bitmap
Bit array in block
Bit array in buffer
Dynamic
inode
ext2_inode
ext2_inode_info
Dynamic
Data block
Array of bytes
VFS buffer
Dynamic
Free inode
ext2_inode
None
Never
Free block
Array of bytes
None
Never

The never-cached data is not kept in any cache because it does not represent meaningful information. Conversely, the always-cached data is always present in RAM, thus it is never necessary to read the data from disk (periodically, however, the data must be written back to disk). In between these extremes lies the dynamic mode. In this mode, the data is kept in a cache as long as the associated object (inode, data block, or bitmap) is in use; when the file is closed or the data block is deleted, the page frame reclaiming algorithm may remove the associated data from the cache.

It is interesting to observe that inode and block bitmaps are not kept permanently in memory; rather, they are read from disk when needed. Actually, many disk reads are avoided thanks to the page cache, which keeps in memory the most recently used disk blocks (see the section "Storing Blocks in the Page Cache" in Chapter 15).^[*]

^[*] In Linux 2.4 and earlier versions, the most recently used inode and block bitmaps were stored in ad-hoc caches of bounded size.

18.3.1. The Ext2 Superblock Object

As stated in the section "Superblock Objects" in Chapter 12, the s_fs_info field of the VFS superblock points to a structure containing filesystem-specific data. In the case of Ext2, this field points to a structure of type ext2_sb_info, which includes the following information:

Most of the disk superblock fields
An s_sbh pointer to the buffer head of the buffer containing the disk superblock
An s_es pointer to the buffer containing the disk superblock
The number of group descriptors, s_desc_ per_block, that can be packed in a block
An s_group_desc pointer to an array of buffer heads of buffers containing the group descriptors (usually, a single entry is sufficient)
Other data related to mount state, mount options, and so on

Figure 18-4 shows the links between the ext2_sb_info data structures and the buffers and buffer heads relative to the Ext2 superblock and to the group descriptors.

When the kernel mounts an Ext2 filesystem, it invokes the ext2_fill_super( ) function to allocate space for the data structures and to fill them with data read from disk (see the section "Mounting a Generic Filesystem" in Chapter 12). This is a simplified description of the function, which emphasizes the memory allocations for buffers and descriptors:

Allocates an ext2_sb_info descriptor and stores its address in the s_fs_info field of the superblock object passed as the parameter.

Figure 18-4. The ext2_sb_info data structure
Invokes _ _bread( ) to allocate a buffer in a buffer page together with the corresponding buffer head, and to read the superblock from disk into the buffer; as discussed in the section "Searching Blocks in the Page Cache" in Chapter 15, no allocation is performed if the block is already stored in a buffer page in the page cache and it is up-to-date. Stores the buffer head address in the s_sbh field of the Ext2 superblock object.
Allocates an array of bytesone byte for each groupand stores its address in the s_debts field of the ext2_sb_info descriptor (see the section "Creating inodes" later in this chapter).
Allocates an array of pointers to buffer heads, one for each group descriptor, and stores the address of the array in the s_group_desc field of the ext2_sb_info descriptor.
Invokes repeatedly _ _bread( ) to allocate buffers and to read from disk the blocks containing the Ext2 group descriptors; stores the addresses of the buffer heads in the s_group_desc array allocated in the previous step.
Allocates an inode and a dentry object for the root directory, and sets up a few fields of the superblock object so that it will be possible to read the root inode from disk.

Clearly, all the data structures allocated by ext2_fill_super( ) are kept in memory after the function returns; they will be released only when the Ext2 filesystem will be unmounted. When the kernel must modify a field in the Ext2 superblock, it simply writes the new value in the proper position of the corresponding buffer and then marks the buffer as dirty.

18.3.2. The Ext2 inode Object

When opening a file, a pathname lookup is performed. For each component of the pathname that is not already in the dentry cache , a new dentry object and a new inode object are created (see the section "Standard Pathname Lookup" in Chapter 12). When the VFS accesses an Ext2 disk inode, it creates a corresponding inode descriptor of type ext2_inode_info. This descriptor includes the following information:

The whole VFS inode object (see Table 12-3 in Chapter 12) stored in the field vfs_inode
Most of the fields found in the disk's inode structure that are not kept in the VFS inode
The i_block_group block group index at which the inode belongs (see the section "Ext2 Disk Data Structures" earlier in this chapter)
The i_next_alloc_block and i_next_alloc_goal fields, which store the logical block number and the physical block number of the disk block that was most recently allocated to the file, respectively
The i_prealloc_block and i_prealloc_count fields, which are used for data block preallocation (see the section "Allocating a Data Block" later in this chapter)
The xattr_sem field, a read/write semaphore that allows extended attributes to be read concurrently with the file data
The i_acl and i_default_acl fields, which point to the ACLs of the file

When dealing with Ext2 files, the alloc_inode superblock method is implemented by means of the ext2_alloc_inode( ) function. It gets first an ext2_inode_info descriptor from the ext2_inode_cachep slab allocator cache, then it returns the address of the inode object embedded in the new ext2_inode_info descriptor.

Создание Ext2 Filesystem

There are generally two stages to creating a filesystem on a disk. The first step is to format it so that the disk driver can read and write blocks on it. Modern hard disks come preformatted from the factory and need not be reformatted; floppy disks may be formatted on Linux using a utility program such as superformat or fdformat. The second step involves creating a filesystem, which means setting up the structures described in detail earlier in this chapter.

Ext2 filesystems are created by the mke2fs utility program; it assumes the following default options, which may be modified by the user with flags on the command line:

Block size: 1,024 bytes (default value for a small filesystem)
Fragment size: block size (block fragmentation is not implemented)
Number of allocated inodes: 1 inode for each 8,192 bytes
Percentage of reserved blocks: 5 percent

The program performs the following actions:

Initializes the superblock and the group descriptors.
Optionally, checks whether the partition contains defective blocks; if so, it creates a list of defective blocks.
For each block group, reserves all the disk blocks needed to store the superblock, the group descriptors, the inode table, and the two bitmaps.
Initializes the inode bitmap and the data map bitmap of each block group to 0.
Initializes the inode table of each block group.
Creates the /root directory.
Creates the lost+found directory, which is used by e2fsck to link the lost and found defective blocks.
Updates the inode bitmap and the data block bitmap of the block group in which the two previous directories have been created.
Groups the defective blocks (if any) in the lost+found directory.

Let's consider how an Ext2 1.44 MB floppy disk is initialized by mke2fs with the default options.

Once mounted, it appears to the VFS as a volume consisting of 1,412 blocks; each one is 1,024 bytes in length. To examine the disk's contents, we can execute the Unix command:

$ dd if=/dev/fd0 bs=1k count=1440 | od -tx1 -Ax > /tmp/dump_hex

to get a file containing the hexadecimal dump of the floppy disk contents in the /tmp directory.^[*]

^[*] Most information on an Ext2 filesystem could also be obtained by using the dumpe2fs and debugfs utility programs.

By looking at that file, we can see that, due to the limited capacity of the disk, a single group descriptor is sufficient. We also notice that the number of reserved blocks is set to 72 (5 percent of 1,440) and, according to the default option, the inode table must include 1 inode for each 8,192 bytes that is, 184 inodes stored in 23 blocks.

Table 18-7 summarizes how the Ext2 filesystem is created on a floppy disk when the default options are selected.

Table 18-7. Ext2 block allocation for a floppy disk
Block
Content
0
Boot block
1
Superblock
2
Block containing a single block group descriptor
3
Data block bitmap
4
inode bitmap
5-27
inode table: inodes up to 10: reserved (inode 2 is the root); inode 11: lost+found; inodes 12-184: free
28
Root directory (includes ., .., and lost+found)
29
lost+found directory (includes . and ..)
30-40
Reserved blocks preallocated for lost+found directory
41-1439
Free blocks

Функции Ext2

Many of the VFS methods described in Chapter 12 have a corresponding Ext2 implementation. Because it would take a whole book to describe all of them, we limit ourselves to briefly reviewing the methods implemented in Ext2. Once the disk and the memory data structures are clearly understood, the reader should be able to follow the code of the Ext2 functions that implement them.

18.5.1. Ext2 Superblock Operations

Many VFS superblock operations have a specific implementation in Ext2, namely alloc_inode, destroy_inode, read_inode, write_inode, delete_inode, put_super, write_super, statfs, remount_fs, and clear_inode. The addresses of the superblock methods are stored in the ext2_sops array of pointers.

18.5.2. Ext2 inode Operations

Some of the VFS inode operations have a specific implementation in Ext2, which depends on the type of the file to which the inode refers.

The inode operations for Ext2 regular files and Ext2 directories are shown in Table 18-8; the purpose of each method is described in the section "Inode Objects" in Chapter 12. The table does not show the methods that are undefined (a NULL pointer) for both regular files and directories; recall that if a method is undefined, the VFS either invokes a generic function or does nothing at all. The addresses of the Ext2 methods for regular files and directories are stored in the ext2_file_inode_operations and ext2_dir_inode_operations tables, respectively.

Table 18-8. Ext2 inode operations for regular files and directories
VFS inode operation
Regular file
Directory
create
NULL
ext2_create( )
lookup
NULL
ext2_lookup( )
link
NULL
ext2_link( )
unlink
NULL
ext2_unlink( )
symlink
NULL
ext2_symlink( )
mkdir
NULL
ext2_mkdir( )
rmdir
NULL
ext2_rmdir( )
mknod
NULL
ext2_mknod( )
rename
NULL
ext2_rename( )
truncate
ext2_TRuncate( )
NULL
permission
ext2_permission( )
ext2_permission( )
setattr
ext2_setattr( )
ext2_setattr( )
setxattr
generic_setxattr( )
generic_setxattr( )
getxattr
generic_getxattr( )
generic_getxattr( )
listxattr
ext2_listxattr( )
ext2_listxattr( )
removexattr
generic_removexattr( )
generic_removexattr( )

The inode operations for Ext2 symbolic links are shown in Table 18-9 (undefined methods have been omitted). Actually, there are two types of symbolic links: the fast symbolic links represent pathnames that can be fully stored inside the inodes, while the regular symbolic links represent longer pathnames. Accordingly, there are two sets of inode operations, which are stored in the ext2_fast_symlink_inode_operations and ext2_symlink_inode_operations tables, respectively.

Table 18-9. Ext2 inode operations for fast and regular symbolic links
VFS inode operation
Fast symbolic link
Regular symbolic link
readlink
generic_readlink( )
generic_readlink( )
follow_link
ext2_follow_link( )
page_follow_link_light( )
put_link
NULL
page_put_link( )
setxattr
generic_setxattr( )
generic_setxattr( )
getxattr
generic_getxattr( )
generic_getxattr( )
listxattr
ext2_listxattr( )
ext2_listxattr( )
removexattr
generic_removexattr( )
generic_removexattr( )

If the inode refers to a character device file, to a block device file, or to a named pipe (see "FIFOs" in Chapter 19), the inode operations do not depend on the filesystem. They are specified in the chrdev_inode_operations, blkdev_inode_operations, and fifo_inode_operations tables, respectively.

18.5.3. Ext2 File Operations

The file operations specific to the Ext2 filesystem are listed in Table 18-10. As you can see, several VFS methods are implemented by generic functions that are common to many filesystems. The addresses of these methods are stored in the ext2_file_operations table.

Table 18-10. Ext2 file operations
VFS file operation
Ext2 method
llseek
generic_file_llseek( )
read
generic_file_read( )
write
generic_file_write( )
aio_read
generic_file_aio_read( )
aio_write
generic_file_aio_write( )
ioctl
ext2_ioctl( )
mmap
generic_file_mmap( )
open
generic_file_open( )
release
ext2_release_file( )
fsync
ext2_sync_file( )
readv
generic_file_readv( )
writev
generic_file_writev( )
sendfile
generic_file_sendfile( )

Notice that the Ext2's read and write methods are implemented by the generic_file_read( ) and generic_file_write( ) functions, respectively. These are described in the sections "Reading from a File" and "Writing to a File" in Chapter 16.

Управление дисковым пространством Ext2

The storage of a file on disk differs from the view the programmer has of the file in two ways: blocks can be scattered around the disk (although the filesystem tries hard to keep blocks sequential to improve access time), and files may appear to a programmer to be bigger than they really are because a program can introduce holes into them (through the lseek( ) system call).

In this section, we explain how the Ext2 filesystem manages the disk space how it allocates and deallocates inodes and data blocks. Two main problems must be addressed:

Space management must make every effort to avoid file fragmentation the physical storage of a file in several, small pieces located in non-adjacent disk blocks. File fragmentation increases the average time of sequential read operations on the files, because the disk heads must be frequently repositioned during the read operation.^[*] This problem is similar to the external fragmentation of RAM discussed in the section "The Buddy System Algorithm" in Chapter 8.
^[*] Please note that fragmenting a file across block groups (A Bad Thing) is quite different from the not-yet-implemented fragmentation of blocks to store many files in one block (A Good Thing).
Space management must be time-efficient; that is, the kernel should be able to quickly derive from a file offset the corresponding logical block number in the Ext2 partition. In doing so, the kernel should limit as much as possible the number of accesses to addressing tables stored on disk, because each such intermediate access considerably increases the average file access time.

18.6.1. Creating inodes

The ext2_new_inode( ) function creates an Ext2 disk inode, returning the address of the corresponding inode object (or NULL, in case of failure). The function carefully selects the block group that contains the new inode; this is done to spread unrelated directories among different groups and, at the same time, to put files into the same group as their parent directories. To balance the number of regular files and directories in a block group, Ext2 introduces a "debt" parameter for every block group.

The function acts on two parameters: the address dir of the inode object that refers to the directory into which the new inode must be inserted and a mode that indicates the type of inode being created. The latter argument also includes the MS_SYNCHRONOUS mount flag (see the section "Mounting a Generic Filesystem" in Chapter 12) that requires the current process to be suspended until the inode is allocated. The function performs the following actions:

Invokes new_inode( ) to allocate a new VFS inode object; initializes its i_sb field to the superblock address stored in dir->i_sb, and adds it to the in-use inode list and to the superblock's list (see the section "Inode Objects" in Chapter 12).
If the new inode is a directory, the function invokes find_group_orlov( ) to find a suitable block group for the directory.^[*] This function implements the following heuristics:
^[*] The Ext2 filesystem may also be mounted with an option flag that forces the kernel to make use of a simpler, older allocation strategy, which is implemented by the find_group_dir( ) function.
1. Directories having as parent the filesystem root should be spread among all block groups. Thus, the function searches the block groups looking for a group having a number of free inodes and a number of free blocks above the average. If there is no such group, it jumps to step 2c.
2. Nested directoriesnot having the filesystem root as parentshould be put in the group of the parent if it satisfies the following rules:
  The group does not contain too many directories
  The group has a sufficient number of free inodes left
  The group has a small "debt" (the debt of a block group is stored in the array of counters pointed to by the s_debts field of the ext2_sb_info descriptor; the debt is increased each time a new directory is added and decreased each time another type of file is added)
  If the parent's group does not satisfy these rules, it picks the first group that satisfies them. If no such group exists, it jumps to step 2c.
3. This is the "fallback" rule, to be used if no good group has been found. The function starts with the block group containing the parent directory and selects the first block group that has more free inodes than the average number of free inodes per block group.
If the new inode is not a directory, it invokes find_group_other( ) to allocate it in a block group having a free inode. This function selects the group by starting from the one that contains the parent directory and moving farther away from it; to be precise:
1. Performs a quick logarithmic search starting from the block group that includes the parent directory dir. The algorithm searches log(n) block groups, where n is the total number of block groups. The algorithm jumps further ahead until it finds an available block group for example, if we call the number of the starting block group i, the algorithm considers block groups i mod(n), i+1 mod(n), i+1+2 mod(n), i+1+2+4 mod(n), etc.
2. If the logarithmic search failed in finding a block group with a free inode, the function performs an exhaustive linear search starting from the block group that includes the parent directory dir.
Invokes read_inode_bitmap( ) to get the inode bitmap of the selected block group and searches for the first null bit into it, thus obtaining the number of the first free disk inode.
Allocates the disk inode: sets the corresponding bit in the inode bitmap and marks the buffer containing the bitmap as dirty. Moreover, if the filesystem has been mounted specifying the MS_SYNCHRONOUS flag (see the section "Mounting a Generic Filesystem" in Chapter 12), the function invokes sync_dirty_buffer( ) to start the I/O write operation and waits until the operation terminates.
Decreases the bg_free_inodes_count field of the group descriptor. If the new inode is a directory, the function increases the bg_used_dirs_count field and marks the buffer containing the group descriptor as dirty.
Increases or decreases the group's counter in the s_debts array of the superblock, according to whether the inode refers to a regular file or a directory.
Decreases the s_freeinodes_counter field of the ext2_sb_info data structure; moreover, if the new inode is a directory, it increases the s_dirs_counter field in the ext2_sb_info data structure.
Sets the s_dirt flag of the superblock to 1, and marks the buffer that contains it to as dirty.
Sets the s_dirt field of the VFS's superblock object to 1.
Initializes the fields of the inode object. In particular, it sets the inode number i_no and copies the value of xtime.tv_sec into i_atime, i_mtime, and i_ctime. Also loads the i_block_group field in the ext2_inode_info structure with the block group index. Refer to Table 18-3 for the meaning of these fields.
Initializes the ACLs of the inode.
Inserts the new inode object into the hash table inode_hashtable and invokes mark_inode_dirty( ) to move the inode object into the superblock's dirty inode list (see the section "Inode Objects" in Chapter 12).
Invokes ext2_preread_inode( ) to read from disk the block containing the inode and to put the block in the page cache. This type of read-ahead is done because it is likely that a recently created inode will be written back soon.
Returns the address of the new inode object.

18.6.2. Deleting inodes

The ext2_free_inode( ) function deletes a disk inode, which is identified by an inode object whose address inode is passed as the parameter. The kernel should invoke the function after a series of cleanup operations involving internal data structures and the data in the file itself. It should come after the inode object has been removed from the inode hash table, after the last hard link referring to that inode has been deleted from the proper directory and after the file is truncated to 0 length to reclaim all its data blocks (see the section "Releasing a Data Block" later in this chapter). It performs the following actions:

Invokes clear_inode( ), which in turn executes the following operations:
1. Removes any dirty "indirect" buffer associated with the inode (see the later section "Data Blocks Addressing"); they are collected in the list headed at the private_list field of the address_space object inode->i_data (see the section "The address_space Object" in Chapter 15).
2. If the I_LOCK flag of the inode is set, some of the inode's buffers are involved in I/O data transfers; the function suspends the current process until these I/O data transfers terminate.
3. Invokes the clear_inode method of the superblock object, if defined; the Ext2 filesystem does not define it.
4. If the inode refers to a device file, it removes the inode object from the device's list of inodes; this list is rooted either in the list field of the cdev character device descriptor (see the section "Character Device Drivers" in Chapter 13) or in the bd_inodes field of the block_device block device descriptor (see the section "Block Devices" in Chapter 14).
5. Sets the state of the inode to I_CLEAR (the inode object contents are no longer meaningful).
Computes the index of the block group containing the disk inode from the inode number and the number of inodes in each block group.
Invokes read_inode_bitmap( ) to get the inode bitmap.
Increases the bg_free_inodes_count( ) field of the group descriptor. If the deleted inode is a directory, it decreases the bg_used_dirs_count field. Marks the buffer that contains the group descriptor as dirty.
If the deleted inode is a directory, it decreases the s_dirs_counter field in the ext2_sb_info data structure, sets the s_dirt flag of the superblock to 1, and marks the buffer that contains it as dirty.
Clears the bit corresponding to the disk inode in the inode bitmap and marks the buffer that contains the bitmap as dirty. Moreover, if the filesystem has been mounted with the MS_SYNCHRONIZE flag, it invokes sync_dirty_buffer( ) to wait until the write operation on the bitmap's buffer terminates.

18.6.3. Data Blocks Addressing

Each nonempty regular file consists of a group of data blocks . Such blocks may be referred to either by their relative position inside the file their file block numberor by their position inside the disk partitiontheir logical block number (see the section "Block Devices Handling" in Chapter 14).

Deriving the logical block number of the corresponding data block from an offset f inside a file is a two-step process:

Derive from the offset f the file block number the index of the block that contains the character at offset f.
Translate the file block number to the corresponding logical block number.

Because Unix files do not include any control characters, it is quite easy to derive the file block number containing the f th character of a file: simply take the quotient of f and the filesystem's block size and round down to the nearest integer.

For instance, let's assume a block size of 4 KB. If f is smaller than 4,096, the character is contained in the first data block of the file, which has file block number 0. If f is equal to or greater than 4,096 and less than 8,192, the character is contained in the data block that has file block number 1, and so on.

This is fine as far as file block numbers are concerned. However, translating a file block number into the corresponding logical block number is not nearly as straightforward, because the data blocks of an Ext2 file are not necessarily adjacent on disk.

The Ext2 filesystem must therefore provide a method to store the connection between each file block number and the corresponding logical block number on disk. This mapping, which goes back to early versions of Unix from AT&T, is implemented partly inside the inode. It also involves some specialized blocks that contain extra pointers, which are an inode extension used to handle large files.

The i_block field in the disk inode is an array of EXT2_N_BLOCKS components that contain logical block numbers. In the following discussion, we assume that EXT2_N_BLOCKS has the default value, namely 15. The array represents the initial part of a larger data structure, which is illustrated in Figure 18-5. As can be seen in the figure, the 15 components of the array are of 4 different types:

The first 12 components yield the logical block numbers corresponding to the first 12 blocks of the fileto the blocks that have file block numbers from 0 to 11.
The component at index 12 contains the logical block number of a block, called indirect block, that represents a second-order array of logical block numbers. They correspond to the file block numbers ranging from 12 to b/4+11, where b is the filesystem's block size (each logical block number is stored in 4 bytes, so we divide by 4 in the formula). Therefore, the kernel must look in this component for a pointer to a block, and then look in that block for another pointer to the ultimate block that contains the file contents.
The component at index 13 contains the logical block number of an indirect block containing a second-order array of logical block numbers; in turn, the entries of this second-order array point to third-order arrays, which store the logical block numbers that correspond to the file block numbers ranging from b/4+12 to (b/4)²+(b/4)+11.
Finally, the component at index 14 uses triple indirection: the fourth-order arrays store the logical block numbers corresponding to the file block numbers ranging from (b/4)²+(b/4)+12 to (b/4)³+(b/4)²+(b/4)+11.

Figure 18-5. Data structures used to address the file's data blocks

In Figure 18-5, the number inside a block represents the corresponding file block number. The arrows, which represent logical block numbers stored in array components, show how the kernel finds its way through indirect blocks to reach the block that contains the actual contents of the file.

Notice how this mechanism favors small files. If the file does not require more than 12 data blocks, every data can be retrieved in two disk accesses: one to read a component in the i_block array of the disk inode and the other to read the requested data block. For larger files, however, three or even four consecutive disk accesses may be needed to access the required block. In practice, this is a worst-case estimate, because dentry, inode, and page caches contribute significantly to reduce the number of real disk accesses.

Notice also how the block size of the filesystem affects the addressing mechanism, because a larger block size allows the Ext2 to store more logical block numbers inside a single block. Table 18-11 shows the upper limit placed on a file's size for each block size and each addressing mode. For instance, if the block size is 1,024 bytes and the file contains up to 268 kilobytes of data, the first 12 KB of a file can be accessed through direct mapping and the remaining 13-268 KB can be addressed through simple indirection. Files larger than 2 GB must be opened on 32-bit architectures by specifying the O_LARGEFILE opening flag.

Table 18-11. File-size upper limits for data block addressing
Block size
Direct
1-Indirect
2-Indirect
3-Indirect
1,024
12 KB
268 KB
64.26 MB
16.06 GB
2,048
24 KB
1.02 MB
513.02 MB
256.5 GB
4,096
48 KB
4.04 MB
4 GB
~ 4 TB

18.6.4. File Holes

A file hole is a portion of a regular file that contains null characters and is not stored in any data block on disk. Holes are a long-standing feature of Unix files. For instance, the following Unix command creates a file in which the first bytes are a hole:

$ echo -n "X" | dd of=/tmp/hole bs=1024 seek=6

Now /tmp/hole has 6,145 characters (6,144 null characters plus an X character), yet the file occupies just one data block on disk.

File holes were introduced to avoid wasting disk space. They are used extensively by database applications and, more generally, by all applications that perform hashing on files.

The Ext2 implementation of file holes is based on dynamic data block allocation: a block is actually assigned to a file only when the process needs to write data into it. The i_size field of each inode defines the size of the file as seen by the program, including the holes, while the i_blocks field stores the number of data blocks effectively assigned to the file (in units of 512 bytes).

In our earlier example of the dd command, suppose the /tmp/hole file was created on an Ext2 partition that has blocks of size 4,096. The i_size field of the corresponding disk inode stores the number 6,145, while the i_blocks field stores the number 8 (because each 4,096-byte block includes eight 512-byte blocks). The second element of the i_block array (corresponding to the block having file block number 1) stores the logical block number of the allocated block, while all other elements in the array are null (see Figure 18-6).

Figure 18-6. A file with an initial hole

18.6.5. Allocating a Data Block

When the kernel has to locate a block holding data for an Ext2 regular file, it invokes the ext2_get_block( ) function. If the block does not exist, the function automatically allocates the block to the file. Remember that this function may be invoked every time the kernel issues a read or write operation on an Ext2 regular file (see the sections "Reading from a File" and "Writing to a File" in Chapter 16); clearly, this function is invoked only if the affected block is not included in the page cache.

The ext2_get_block( ) function handles the data structures already described in the section "Data Blocks Addressing," and when necessary, invokes the ext2_alloc_block( ) function to actually search for a free block in the Ext2 partition. If necessary, the function also allocates the blocks used for indirect addressing (see Figure 18-5).

To reduce file fragmentation, the Ext2 filesystem tries to get a new block for a file near the last block already allocated for the file. Failing that, the filesystem searches for a new block in the block group that includes the file's inode. As a last resort, the free block is taken from one of the other block groups.

The Ext2 filesystem uses preallocation of data blocks. The file does not get only the requested block, but rather a group of up to eight adjacent blocks. The i_prealloc_count field in the ext2_inode_info structure stores the number of data blocks preallocated to a file that are still unused, and the i_prealloc_block field stores the logical block number of the next preallocated block to be used. All preallocated blocks that remain unused are freed when the file is closed, when it is truncated, or when a write operation is not sequential with respect to the write operation that triggered the block preallocation.

The ext2_alloc_block( ) function receives as its parameters a pointer to an inode object, a goal , and the address of a variable that will store an error code. The goal is a logical block number that represents the preferred position of the new block. The ext2_get_block( ) function sets the goal parameter according to the following heuristic:

If the block that is being allocated and the previously allocated block have consecutive file block numbers, the goal is the logical block number of the previous block plus 1; it makes sense that consecutive blocks as seen by a program should be adjacent on disk.
If the first rule does not apply and at least one block has been previously allocated to the file, the goal is one of these blocks' logical block numbers. More precisely, it is the logical block number of the already allocated block that precedes the block to be allocated in the file.
If the preceding rules do not apply, the goal is the logical block number of the first block (not necessarily free) in the block group that contains the file's inode.

The ext2_alloc_block( ) function checks whether the goal refers to one of the preallocated blocks of the file. If so, it allocates the corresponding block and returns its logical block number; otherwise, the function discards all remaining preallocated blocks and invokes ext2_new_block( ).

This latter function searches for a free block inside the Ext2 partition with the following strategy:

If the preferred block passed to ext2_alloc_block( )the block that is the goalis free, the function allocates the block.
If the goal is busy, the function checks whether one of the next blocks after the preferred block is free.
If no free block is found in the near vicinity of the preferred block, the function considers all block groups, starting from the one including the goal. For each block group, the function does the following:
1. Looks for a group of at least eight adjacent free blocks.
2. If no such group is found, looks for a single free block.

The search ends as soon as a free block is found. Before terminating, the ext2_new_block( ) function also tries to preallocate up to eight free blocks adjacent to the free block found and sets the i_prealloc_block and i_prealloc_count fields of the disk inode to the proper block location and number of blocks.

18.6.6. Releasing a Data Block

When a process deletes a file or truncates it to 0 length, all its data blocks must be reclaimed. This is done by ext2_truncate( ), which receives the address of the file's inode object as its parameter. The function essentially scans the disk inode's i_block array to locate all data blocks and all blocks used for the indirect addressing. These blocks are then released by repeatedly invoking ext2_free_blocks( ).

The ext2_free_blocks( ) function releases a group of one or more adjacent data blocks. Besides its use by ext2_truncate( ), the function is invoked mainly when discarding the preallocated blocks of a file (see the earlier section "Allocating a Data Block"). Its parameters are:

inode: The address of the inode object that describes the file
block: The logical block number of the first block to be released
count: The number of adjacent blocks to be released

The function performs the following actions for each block to be released:

Gets the block bitmap of the block group that includes the block to be released
Clears the bit in the block bitmap that corresponds to the block to be released and marks the buffer that contains the bitmap as dirty.
Increases the bg_free_blocks_count field in the block group descriptor and marks the corresponding buffer as dirty.
Increases the s_free_blocks_count field of the disk superblock, marks the corresponding buffer as dirty, and sets the s_dirt flag of the superblock object.
If the filesystem has been mounted with the MS_SYNCHRONOUS flag set, it invokes sync_dirty_buffer( ) and waits until the write operation on the bitmap's buffer terminates.

Файловая система Ext3

In this section we'll briefly describe the enhanced filesystem that has evolved from Ext2, named Ext3. The new filesystem has been designed with two simple concepts in mind:

To be a journaling filesystem (see the next section)
To be, as much as possible, compatible with the old Ext2 filesystem

Ext3 achieves both the goals very well. In particular, it is largely based on Ext2, so its data structures on disk are essentially identical to those of an Ext2 filesystem. As a matter of fact, if an Ext3 filesystem has been cleanly unmounted, it can be remounted as an Ext2 filesystem; conversely, creating a journal of an Ext2 filesystem and remounting it as an Ext3 filesystem is a simple, fast operation.

Thanks to the compatibility between Ext3 and Ext2, most descriptions in the previous sections of this chapter apply to Ext3 as well. Therefore, in this section, we focus on the new feature offered by Ext3 "the journal."

18.7.1. Journaling Filesystems

As disks became larger, one design choice of traditional Unix filesystems (such as Ext2) turns out to be inappropriate. As we know from Chapter 14, updates to filesystem blocks might be kept in dynamic memory for long period of time before being flushed to disk. A dramatic event such as a power-down failure or a system crash might thus leave the filesystem in an inconsistent state. To overcome this problem, each traditional Unix filesystem is checked before being mounted; if it has not been properly unmounted, then a specific program executes an exhaustive, time-consuming check and fixes all the filesystem's data structures on disk.

For instance, the Ext2 filesystem status is stored in the s_mount_state field of the superblock on disk. The e2fsck utility program is invoked by the boot script to check the value stored in this field; if it is not equal to EXT2_VALID_FS, the filesystem was not properly unmounted, and therefore e2fsck starts checking all disk data structures of the filesystem.

Clearly, the time spent checking the consistency of a filesystem depends mainly on the number of files and directories to be examined; therefore, it also depends on the disk size. Nowadays, with filesystems reaching hundreds of gigabytes, a single consistency check may take hours. The involved downtime is unacceptable for every production environment or high-availability server.

The goal of a journaling filesystem is to avoid running time-consuming consistency checks on the whole filesystem by looking instead in a special disk area that contains the most recent disk write operations named journal. Remounting a journaling filesystem after a system failure is a matter of a few seconds.

18.7.2. The Ext3 Journaling Filesystem

The idea behind Ext3 journaling is to perform each high-level change to the filesystem in two steps. First, a copy of the blocks to be written is stored in the journal; then, when the I/O data transfer to the journal is completed (in short, data is committed to the journal), the blocks are written in the filesystem. When the I/O data transfer to the filesystem terminates (data is committed to the filesystem), the copies of the blocks in the journal are discarded.

While recovering after a system failure, the e2fsck program distinguishes the following two cases:

The system failure occurred before a commit to the journal: Either the copies of the blocks relative to the high-level change are missing from the journal or they are incomplete; in both cases, e2fsck ignores them.
The system failure occurred after a commit to the journal: The copies of the blocks are valid, and e2fsck writes them into the filesystem.

In the first case, the high-level change to the filesystem is lost, but the filesystem state is still consistent. In the second case, e2fsck applies the whole high-level change, thus fixing every inconsistency due to unfinished I/O data transfers into the filesystem.

Don't expect too much from a journaling filesystem; it ensures consistency only at the system call level. For instance, a system failure that occurs while you are copying a large file by issuing several write( ) system calls will interrupt the copy operation, thus the duplicated file will be shorter than the original one.

Furthermore, journaling filesystems do not usually copy all blocks into the journal. In fact, each filesystem consists of two kinds of blocks: those containing the so-called metadata and those containing regular data. In the case of Ext2 and Ext3, there are six kinds of metadata: superblocks, group block descriptors, inodes, blocks used for indirect addressing (indirection blocks), data bitmap blocks, and inode bitmap blocks. Other filesystems may use different metadata.

Several journaling filesystems, such as SGI's XFS and IBM's JFS , limit themselves to logging the operations affecting metadata. In fact, metadata's log records are sufficient to restore the consistency of the on-disk filesystem data structures. However, since operations on blocks of file data are not logged, nothing prevents a system failure from corrupting the contents of the files.

The Ext3 filesystem, however, can be configured to log the operations affecting both the filesystem metadata and the data blocks of the files. Because logging every kind of write operation leads to a significant performance penalty, Ext3 lets the system administrator decide what has to be logged; in particular, it offers three different journaling modes :

Journal: All filesystem data and metadata changes are logged into the journal. This mode minimizes the chance of losing the updates made to each file, but it requires many additional disk accesses. For example, when a new file is created, all its data blocks must be duplicated as log records. This is the safest and slowest Ext3 journaling mode.
Ordered: Only changes to filesystem metadata are logged into the journal. However, the Ext3 filesystem groups metadata and relative data blocks so that data blocks are written to disk before the metadata. This way, the chance to have data corruption inside the files is reduced; for instance, each write access that enlarges a file is guaranteed to be fully protected by the journal. This is the default Ext3 journaling mode.
Writeback: Only changes to filesystem metadata are logged; this is the method found on the other journaling filesystems and is the fastest mode.

The journaling mode of the Ext3 filesystem is specified by an option of the mount system command. For instance, to mount an Ext3 filesystem stored in the /dev/sda2 partition on the /jdisk mount point with the "writeback" mode, the system administrator can type the command:

# mount -t ext3 -o data=writeback /dev/sda2 /jdisk

18.7.3. The Journaling Block Device Layer

The Ext3 journal is usually stored in a hidden file named .journal located in the root directory of the filesystem.

The Ext3 filesystem does not handle the journal on its own; rather, it uses a general kernel layer named Journaling Block Device, or JBD. Right now, only Ext3 uses the JBD layer, but other filesystems might use it in the future.

The JBD layer is a rather complex piece of software. The Ext3 filesystem invokes the JBD routines to ensure that its subsequent operations don't corrupt the disk data structures in case of system failure. However, JBD typically uses the same disk to log the changes performed by the Ext3 filesystem, and it is therefore vulnerable to system failures as much as Ext3. In other words, JBD must also protect itself from system failures that could corrupt the journal.

Therefore, the interaction between Ext3 and JBD is essentially based on three fundamental units:

Log record: Describes a single update of a disk block of the journaling filesystem.
Atomic operation handle: Includes log records relative to a single high-level change of the filesystem; typically, each system call modifying the filesystem gives rise to a single atomic operation handle.
Transaction: Includes several atomic operation handles whose log records are marked valid for e2fsck at the same time.

18.7.3.1. Log records

A log record is essentially the description of a low-level operation that is going to be issued by the filesystem. In some journaling filesystems, the log record consists of exactly the span of bytes modified by the operation, together with the starting position of the bytes inside the filesystem. The JBD layer, however, uses log records consisting of the whole buffer modified by the low-level operation. This approach may waste a lot of journal space (for instance, when the low-level operation just changes the value of a bit in a bitmap), but it is also much faster because the JBD layer can work directly with buffers and their buffer heads.

Log records are thus represented inside the journal as normal blocks of data (or metadata). Each such block, however, is associated with a small tag of type journal_block_tag_t, which stores the logical block number of the block inside the filesystem and a few status flags.

Later, whenever a buffer is being considered by the JBD, either because it belongs to a log record or because it is a data block that should be flushed to disk before the corresponding metadata block (in the "ordered" journaling mode), the kernel attaches a journal_head data structure to the buffer head. In this case, the b_private field of the buffer head stores the address of the journal_head data structure and the BH_JBD flag is set (see the section "Block Buffers and Buffer Heads" in Chapter 15).

18.7.3.2. Atomic operation handles

Every system call modifying the filesystem is usually split into a series of low-level operations that manipulate disk data structures.

For instance, suppose that Ext3 must satisfy a user request to append a block of data to a regular file. The filesystem layer must determine the last block of the file, locate a free block in the filesystem, update the data block bitmap inside the proper block group, store the logical number of the new block either in the file's inode or in an indirect addressing block, write the contents of the new block, and finally, update several fields of the inode. As you see, the append operation translates into many lower-level operations on the data and metadata blocks of the filesystem.

Now, just imagine what could happen if a system failure occurred in the middle of an append operation, when some of the lower-level manipulations have already been executed while others have not. Of course, the scenario could be even worse, with high-level operations affecting two or more files (for example, moving a file from one directory to another).

To prevent data corruption, the Ext3 filesystem must ensure that each system call is handled in an atomic way. An atomic operation handle is a set of low-level operations on the disk data structures that correspond to a single high-level operation. When recovering from a system failure, the filesystem ensures that either the whole high-level operation is applied or none of its low-level operations is.

Each atomic operation handle is represented by a descriptor of type handle_t. To start an atomic operation, the Ext3 filesystem invokes the journal_start( ) JBD function, which allocates, if necessary, a new atomic operation handle and inserts it into the current transactions (see the next section). Because every low-level operation on the disk might suspend the process, the address of the active handle is stored in the journal_info field of the process descriptor. To notify that an atomic operation is completed, the Ext3 filesystem invokes the journal_stop( ) function.

18.7.3.3. Transactions

For reasons of efficiency, the JBD layer manages the journal by grouping the log records that belong to several atomic operation handles into a single transaction. Furthermore, all log records relative to a handle must be included in the same transaction.

All log records of a transaction are stored in consecutive blocks of the journal. The JBD layer handles each transaction as a whole. For instance, it reclaims the blocks used by a transaction only after all data included in its log records is committed to the filesystem.

As soon as it is created, a transaction may accept log records of new handles. The transaction stops accepting new handles when either of the following occurs:

A fixed amount of time has elapsed, typically 5 seconds.
There are no free blocks in the journal left for a new handle.

A transaction is represented by a descriptor of type TRansaction_t. The most important field is t_state, which describes the current status of the transaction.

Essentially, a transaction can be:

Complete

All log records included in the transaction have been physically written onto the journal. When recovering from a system failure, e2fsck considers every complete transaction of the journal and writes the corresponding blocks into the filesystem. In this case, the t_state field stores the value T_FINISHED.

Incomplete

At least one log record included in the transaction has not yet been physically written to the journal, or new log records are still being added to the transaction. In case of system failure, the image of the transaction stored in the journal is likely not up-to-date. Therefore, when recovering from a system failure, e2fsck does not trust the incomplete transactions in the journal and skips them. In this case, the t_state field stores one of the following values:

T_RUNNING: Still accepting new atomic operation handles.
T_LOCKED: Not accepting new atomic operation handles, but some of them are still unfinished.
T_FLUSH: All atomic operation handles have finished, but some log records are still being written to the journal.
T_COMMIT: All log records of the atomic operation handles have been written to disk, but the transaction has yet to be marked as completed on the journal.

At any time the journal may include several transactions, but only one of them is in the T_RUNNING state it is the active transaction that is accepting the new atomic operation handle requests issued by the Ext3 filesystem.

Several transactions in the journal might be incomplete, because the buffers containing the relative log records have not yet been written to the journal.

If a transaction is complete, all its log records have been written to the journal but some of the corresponding buffers have yet to be written onto the filesystem. A complete transaction is deleted from the journal when the JBD layer verifies that all buffers described by the log records have been successfully written onto the Ext3 filesystem.

18.7.4. How Journaling Works

Let's try to explain how journaling works with an example: the Ext3 filesystem layer receives a request to write some data blocks of a regular file.

As you might easily guess, we are not going to describe in detail every single operation of the Ext3 filesystem layer and of the JBD layer. There would be far too many issues to be covered! However, we describe the essential actions:

The service routine of the write( ) system call triggers the write method of the file object associated with the Ext3 regular file. For Ext3, this method is implemented by the generic_file_write( ) function, already described in the section "Writing to a File" in Chapter 16.
The generic_file_write( ) function invokes the prepare_write method of the address_space object several times, once for every page of data involved by the write operation. For Ext3, this method is implemented by the ext3_prepare_write( ) function.
The ext3_prepare_write( ) function starts a new atomic operation by invoking the journal_start( ) JBD function. The handle is added to the active transaction. Actually, the atomic operation handle is created only when executing the first invocation of the journal_start( ) function. Following invocations verify that the journal_info field of the process descriptor is already set and use the referenced handle.
The ext3_prepare_write( ) function invokes the block_prepare_write( ) function already described in Chapter 16, passing to it the address of the ext3_get_block( ) function. Remember that block_prepare_write( ) takes care of preparing the buffers and the buffer heads of the file's page.
When the kernel must determine the logical number of a block of the Ext3 filesystem, it executes the ext3_get_block( ) function. This function is actually similar to ext2_get_block( ), which is described in the earlier section "Allocating a Data Block." A crucial difference, however, is that the Ext3 filesystem invokes functions of the JBD layer to ensure that the low-level operations are logged:
- Before issuing a low-level write operation on a metadata block of the filesystem, the function invokes journal_get_write_access( ). Basically, this latter function adds the metadata buffer to a list of the active transaction. However, it must also check whether the metadata is included in an older incomplete transaction of the journal; in this case, it duplicates the buffer to make sure that the older transactions are committed with the old content.
- After updating the buffer containing the metadata block, the Ext3 filesystem invokes journal_dirty_metadata( ) to move the metadata buffer to the proper dirty list of the active transaction and to log the operation in the journal.
Notice that metadata buffers handled by the JBD layer are not usually included in the dirty lists of buffers of the inode, so they are not written to disk by the normal disk cache flushing mechanisms described in Chapter 15.
If the Ext3 filesystem has been mounted in "journal" mode, the ext3_prepare_write( ) function also invokes journal_get_write_access( ) on every buffer touched by the write operation.
Control returns to the generic_file_write( ) function, which updates the page with the data stored in the User Mode address space and then invokes the commit_write method of the address_space object. For Ext3, the function that implements this method depends on how the Ext3 filesystem has been mounted:
- If the Ext3 filesystem has been mounted in "journal" mode, the commit_write method is implemented by the ext3_journalled_commit_write( ) function, which invokes journal_dirty_metadata( ) on every buffer of data (not metadata) in the page. This way, the buffer is included in the proper dirty list of the active transaction and not in the dirty list of the owner inode; moreover, the corresponding log records are written to the journal. Finally, ext3_journalled_commit_write( ) invokes journal_stop( ) to notify the JBD layer that the atomic operation handle is closed.
- If the Ext3 filesystem has been mounted in "ordered" mode, the commit_write method is implemented by the ext3_ordered_commit_write( ) function, which invokes the journal_dirty_data( ) function on every buffer of data in the page to insert the buffer in a proper list of the active transactions. The JBD layer ensures that all buffers in this list are written to disk before the metadata buffers of the transaction. No log record is written onto the journal. Next, ext3_ordered_commit_write( ) executes the normal generic_commit_write( ) function described in Chapter 15, which inserts the data buffers in the list of the dirty buffers of the owner inode. Finally, ext3_ordered_commit_write( ) invokes journal_stop( ) to notify the JBD layer that the atomic operation handle is closed.
- If the Ext3 filesystem has been mounted in "writeback" mode, the commit_write method is implemented by the ext3_writeback_commit_write( ) function, which executes the normal generic_commit_write( ) function described in Chapter 15, which inserts the data buffers in the list of the dirty buffers of the owner inode. Then, ext3_writeback_commit_write( ) invokes journal_stop( ) to notify the JBD layer that the atomic operation handle is closed.
The service routine of the write( ) system call terminates here. However, the JBD layer has not finished its work. Eventually, our transaction becomes complete when all its log records have been physically written to the journal. Then journal_commit_transaction( ) is executed.
If the Ext3 filesystem has been mounted in "ordered" mode, the journal_commit_transaction( ) function activates the I/O data transfers for all data buffers included in the list of the transaction and waits until all data transfers terminate.
The journal_commit_transaction( ) function activates the I/O data transfers for all metadata buffers included in the transaction (and also for all data buffers, if Ext3 was mounted in "journal" mode).
Periodically, the kernel activates a checkpoint activity for every complete transaction in the journal. The checkpoint basically involves verifying whether the I/O data transfers triggered by journal_commit_transaction( ) have successfully terminated. If so, the transaction can be deleted from the journal.

Of course, the log records in the journal never play an active role until a system failure occurs. Only during system reboot does the e2fsck utility program scan the journal stored in the filesystem and reschedule all write operations described by the log records of the complete transactions.

Оставьте свой комментарий !

Ваше имя:

Комментарий:

Оба поля являются обязательными

Автор	Комментарий к данной статье