Serg Iakovlev

or:

and:

LINUX

Language

Kernel

Package

Book

Test

Forum

iakovlev.org

Модульная структура

 
 Большинство юниксовых ядер монолитны. 
 При таком подходе вся функциональность операционной системы 
 находится в одном большом куске кода , который выполняется в единственном процессе 
 и в едином адрессном пространстве.
 Все компоненты ядра имеют полный доступ к внутренним стуктурам данных ядра и процедурам.
 Если в ядре происходят какие-то изменения , все модули и процедуры должны быть перелинкованы 
 и система перезагружена для того , чтобы это возимело успех.
 Как результат , любая модификация , такая как добавление нового драйвера или функции файловой системы,
 проблематична. 
 Эта проблема актуальна для линукса , операционной системы , которая разрабатывается группой 
 независимых программистов. Хотя линукс не использует microkernel-подход напрямую , 
 это все-же находит применение в его модульной архитектуре.
 В линуксе имеется набор модулей , которые могут быть загружены-выгружены в произвольный момент.
 Модуль - это обьектный файл , который может быть прилинкован-отлинкован к ядру в ран-тайме.
 В модулях реализованы специфические функции , такие , как файловая система , device driver, 
 или некоторые фичи user space .
 Модуль не выполняется как собственный тред или процесс ,хотя это возможно реализовать.
 Модуль выполняется в пространстве основного процесса самого ядра.
 Это обуславливает некоторую трудность и специфику их реализации. 
 Загружаемые модули имеют 2 важнейших характеристики :
 1. динамическая линковка : модуль может быть загружен и прилинкован к ядру , который УЖЕ находится в памяти.
 Модуль может быть отлинкован также в произвольный момент
 2. Модульная иерархия : модули имеют свою иерархию. Они могут ссылаться на другие модули 
 либо служить источником ссылки от другого модуля.
 Динамическая линковка упрощает конфигурирование и улучшает надежность ядра.
 Пользовательская программа может загрузить-выгрузить ядро с помощью соманд :  insmod и rmmod . 
 Ядро само по мере необходимости может подгружать отдельные модули.
 Между модулями устанавливаются зависимости в иерархии , что имеет следующие преимущества :
 
 	1. Несколько модулей  (например драйвера для железа) 
 		могут быть легко обьединены в один модуль. 
        
 	2. Ядро самостоятельно определяет , нужно ли подгружать какие-то дополнительные модули
 		при загрузке какого-то конкретного модуля , а также решает , может ли быть выгруженным
 		какой-либо модуль , если на него ссылаются другие модули.	
 

 
 Например , пусть нам нужно загрузить 2 модуля : FAT и VFAT. 
 Имеются 2 таблицы : таблица модулей и таблица символов.
 Таблица модулей включает следующие элементы :
          *next: указатель на следующий модуль. Все модули организованы в связанный список. 
                 Список начинается с псевдо-модуля.  
          *name: указатель на имя модуля.  
          size: размер модуля в страницах памяти.  
          usecount: модули используют счетчики. Счетчик -  counter - увеличивается на единицу при старте
                    и уменьшается при выгрузке. 
          flags: флаги.  
          nsyms: число экспортируемых символов.  
          ndeps: число зависимых модулей
          *syms: указатель на символическую таблицу модуля.  
          *deps: указатель на спиок модулей , которые ссылаются (зависят) на данный модуль
          *refs: указатель на список модулей , которые используют данный модкль
 Символическая таблица определяет символы , используемые данным модулем.
 Например , модуль VFAT загружается после модуля FATmodule и что VFAT зависим от FAT.
 
 Компоненты ядра
 Ядро представляет из себя набор компонентов .
          Сигналы: это инструмент или механизм управления процессами. 
                 Например , есть сигналы для уведомления процессы о возникновении исключения ,
 				такого как деление на ноль.
          Системные вызовы: механизм , с помощью которого процесс запрашивает какой-то ресурс
 				или функцию у самого ядра.
 				Системные вызовы можно разделить на 6 категорий :
                 			1 файловая система 
 					2 процессы 
 					3 планировщик
 					4 меж-процессное взаимодействие
 					5 сокеты (сеть)
 					6 все остальное 
          Процессы и планировщик  : создание , управление и диспетчеризация процессами.  
          Виртуальная память :выделяет и управляет виртуальной памятью процессов.  
          Файловая система : обеспечивает глобальный иерархический контроль для файлов , директорий, a
 				а также других типов файлов .
          Сетевые протоколы : поддерживает интерфейс сокетов для TCP/IP протокола.  
          Символьные устройства : к ним относятся терминалы , модемы , принтеры.
          Блочные устройства : диски , сиди-ромы и т.д.
          Сетевые устройства : управляет сетевыми картами и портами 
          Исключения : могут генерироваться со стороны CPU, например memory fault.  
          Физическая память : управляет пулом страниц реальной памяти и выделяет страницы для виртуальной памяти.  
          Прерывания : возникают у периферийных устройств.
 
 
 Некоторые примеры системных вызовов (некоторые уже устарели) :  
 
         close - закрывает файловый дескриптор.
         link  - создвет новое имя файла. 
         open  - открывает/создает файл. 
         read  - читает файл. 
         exit  - останавливает процесс. 
         getpid -  получает id процесса
         setuid - устанавливает id пользователя процессу. 
         prtrace - процесс , который контролирует выполнение другого процесса ,
                   показывает содержимое имиджа процесса и регистров . 
         sched_getparam - устанавливает параметры планировщика для процесса по его pid. 
         sched_get_priority_max  - возвращает максимальный приоритет планировщика
         sched_setscheduler - устанавливает scheduling policy (или FIFO) 
         sched_rr_get_interval -  пишет в структуру timespec параметр tp - т.н. round robin time quantum 
 								 для процесса по его pid. 
         sched_yield - процесс помещается в конец очереди
         msgrcv  - помещает в буфер новое сообщение , полученного из очереди
         semctl  - выполнение контрольной операции с использованием семафора и semid. 
         semop   - выполнение операции для выбранного семафора
         shmat   - выделяет расшаренный сегмент памяти по shmid для вызываемого процесса. 
         shmctl  - выдает информацию о расшаренной памяти , ее владельце , группе , правах и т.д.
         bind    - назначает сокету локальный IP-адрес , возвращает 0 в случае успеха и 1 при ошибке. 
         connect - устанавливает коннект между сокетами с помощью sockaddr. 
         gethostname - возвращает локальное имя хоста
         send    -  отсылает байты для сокета. 
         setsockopt - устанавливает опции сокета
         create_module -  создание обьекта - загружаемый модуль - и резервирование памяти ядра
         fsync - копирование файла из памяти на диск 
         query_module - запрос информации у ядра о загружаемом модуле. 
         time - возвращает время в секундах начиная с January 1, 1970. 
                             
 
    Процессы в Linux
 Процесс , или задача , представлен в линуксе структурой task_struct. 
 Она хранит в себе следующую информацию :
  State: состояние или статус выполнения процесса (выполнение,чтение,спячка,остановка,зомби). 
  Информация для планировщика: необходима для управления процессами. 
                          Процесс имеет приоритет. 
                          Real-time процесс имеет преимущества перед т.н. нормальными процессами. 
                          counter - количество времени , отводимое процессу на выполнение.  
  Идентификаторы: каждый процесс имеет уникальный идентификатор,а также пользовательский и групповой id . 
               Групповой id используется для назначения прав группе процессов.  
  Межпроцессное взаимодействие : Linux поддерживает IPC-механизм  UNIX SVR4.  
  Линки: каждый процесс включает линк на его родительский процесс, 
  линки - siblings - на другие процессы с тем же родителем,
         а также линки на всех его детей.  
  Times и timers:  время создания процесса . 
                    процесс может включать в себя таймеры , которые привязаны к системному вызову; 
                    когда таймер истекает , процессу посылается сигнал . 
                    Таймер может быть однократный либо периодический.  
  Файловая система : указатели на файлы , открытые данным процессом, а также указатели 
               на собственный исполняемый файл и текущий каталог .  
  Адресное пространство : виртуальное адресное пространство процесса.
  Контекст процесса:  регистровая и стековая информация.
  Статус выполнения : 2 состояния - выполняемый либо готовый к исполнению 
  Прерывание : состояние,в котором находится процесс,ожидающий какого-то события, 
                 как конец I/O операции, или сигнал от другого процесса .  
  Uninterruptible: другое блокировочное состояние процесса. Процесс ждет сигнала от железа 
 				и не реагирует более ни на что.
  Stopped: остановленный процесс , например в процессе дебага
  Зомби: процесс убит , а его task structure до сих пор болтается в таблице процессов
 
  Linux Threads
  Классический UNIX традиционно поддерживает один тред для одного процесса , 
 в то время как новейшие версии UNIX обеспечивают поддержку  множества тредов на процесс. 
 В старых версиях Linux нет поддержки  multithreading.
 Сейчас между потоками и процессами практически стирается грань.
 Треды для одного процесса маппятся в процессы с тем же самым ID группы. 
 Это позволяет им расшаривать такие ресурсы , как файлы и память и позволяет избежать
 переключения контекста при переключении планировщиком между тредами одной группы. 
 Новый процесс в линуксе создается путем копирования атрибутов текущего процесса.
 Он может быть клонирован так , что он расшаривает свои ресурсы.
 Когда 2 процесса расшаривают виртуальную память. они ведут себя как треды внутри одного процесса.
 Вместо системного вызова fork(),используется clone(),который имеет определенный набор флагов. 
 При переключении между процессами ядро проверяет их адреса страниц памяти.
 Если они совпадают , переключение контекста вырождается в простой jmp из одной области кода в другую. 
 В отличие от памяти , такие процессы не могут расшаривать стек.
    Флаги системного вызова clone ():
 CLONE_CLEARID           очищает task ID.
 CLONE_FILES CLONE_FS    расшаривает таблицу открытых файлов, а также рутовый каталог и текущий
 CLONE_IDLETASK          устанавливает PID в ноль, который ссылается на idle task. 
                         idle task используется тогда , когда текущая задача занята ресурсами. 
 
                         
 LINUX KERNEL CONCURRENCY MECHANISMS
   Linux включает в себя такие механизмы , заимствованные из  UNIX, 
 как pipes, messages, shared memory, signals. 
 В Linux 2.6 имеются дополнительные механизмы для выполнения тредов в режиме kernel mode. 
 		Atomic Operations
  В линуксе есть набор операций , которые гарантируют атомарность.
 Они выолняются независимо от всяких прерываний и наложений.
 На однопроцессорных системах тред гарантированно выполняет такую атомарную операцию
 от начала и до конца непрерывно.
 На мультипроцессорных системах такая операция будет блокирована от доступа другими тредами.
 В линуксе 2 типа атомарных операций : 
 целочисленные операции, которые оперируют с целочисленными переменными, 
 и битмап-операции, которые оперируют битами. 
 Для различных архитектур реализация на ассемблере может отличаться для таких операций.
 В некоторых архитектурах операция , которая блокирует шину памяти ,
 гарантированно становится атомарной.
          Atomic Integer Operations
 ATOMIC_INIT (int i) 
 int atomic_read(atomic_t *v) 
 void atomic_set(atomic_t *v, int i) 
 void atomic_add(int i, atomic_t *v) 
 void atomic_sub(int i, atomic_t *v) 
 void atomic_inc(atomic_t *v) 
 void atomic_dec(atomic_t *v) 
 int atomic_sub_and_test(int i, atomic_t *v) 
 int atomic_add_negative(int i, atomic_t *v)
 int atomic_dec_and_test(atomic_t *v) 
 int atomic_inc_and_test(atomic_t *v)
         Atomic Bitmap Operations
 void set_bit(int nr, void *addr) 
 void clear_bit(int nr, void *addr) 
 void change_bit(int nr, void *addr) 
 int test_and_set_bit(int nr, void *addr) 
 int test_and_clear_bit(int nr, void *addr) 
 int test_and_change_bit(int nr, void *addr) 
 int test_bit(int nr, void *addr)
 
 Для integer operations используется специальный тип -  atomic_t. 
 Они могут использовать только целочисленные переменные.
 В этом есть следующие преимущества :
 1. Атомарные операции используют только специальные защищенные переменные
 2. Эти переменные защищены от доступа со стороны неатомарных операций. 
 3. Компилятор не в состоянии оптимизировать атомарную операцию. 
 4. Этот тип данных служит для сокрытия архитектурных особенностей реализации.
 
 Пример использования атомарных операций - реализация счетчиков. 
 Битовая атомарная операций работает с набором бит в памяти.
 Атомарная операция является простейшим методом синхронизации внутри ядра.
 На их основе можно строить уже более сложные механизмы блокировки.
         Spinlocks
 Спинлок - техника , используемая в линуксе для защиты критических секций. 
 В данный момент спинлок может быть только у одного треда. 
 Другой тред будет только пытаться получить блокировку.
 Спинлоку выделяется место в памяти.
 Если спинлок=0,тред устанавливает его в 1 и входит в критическую секцию.
 Если значение не равно нулю , тред будет ожидать этого события.
 Спинлок прост в реализации , особенность которой заключается в том ,
 что блокирующий тред будет находиться в состоянии ожидания.
 Спинлок хорош тогда , когда время ожидания весьма мало , например оно равно 
 двум переключениям контекста.
 Базовая форма использования спинлока следующая :
         spin_lock(&lock) 
         /* critical section */ 
         spin_unlock(&lock) 
 
         Basic Spinlocks 
   Типы спинлоков :
 Plain:          В критической секции не используются прерывания или они задисэйблены
 				на время выполнения критической секции
                 это не распространяется на прерывания процессора.  
 _irq:           Прерывания разрешены
 _irqsave:       состояния процессора в момент прерывания сохраняется ,
 				а затем после разблокирования восстанавливается
 _bh:            используется для отмены и последующего разрешения прерываний 
                 для избежания конфликта в защищенных критичеких секциях.
 
 plain spinlock используются тогда , когда прерыванию закрыт доступ к защищенным данным.
 Если  kernel preemption выключено, тред , выполняемый в пространстве ядра , не может быть прерван.
 Если kernel preemption включено, прерывания разрешены, и спинлоки работают 
 в режиме включения/выключения прерываний. 
 Спинлоки позволяют использовать их независимо от того , одно - или много-процессорная машина.
 
 
         Семафоры
 На пользовательском уровне линукс предоставляет интерфейс семафоров , соответствующий UNIX SVR4. 
 В линуксе этот интерфейс реализован по-своему . 
 Код ядра имеет доступ к т.н. ядерным семафорам.
 Пользовательские программы не могут получить к ним доступ , даже через системные вызовы.
 В линуксе есть 3 типа таких семафоров : 
         бинарные 
         счетчики
         чтение-запись
                 
         Традиционные семафоры
 void sema_init(struct semaphore *sem, int count) 
 void init_MUTEX(struct semaphore *sem) 
 void init_MUTEX_LOCKED(struct semaphore *sem) 
 void down(struct semaphore *sem) 
 int down_interruptible(struct semaphore *sem)
 int down_trylock(struct semaphore *sem)
 void up(struct semaphore *sem)
 
         Семафоры на чтение запись
 void init_rwsem(struct rw_semaphore, *rwsem) 
 void down_read(struct rw_semaphore, *rwsem) 
 void up_read(struct rw_semaphore, *rwsem) 
 void down_write(struct rw_semaphore, *rwsem) 
 void up_write(struct rw_semaphore, *rwsem)
 
 Инициализация счетчиков происходит с помощью sema_init function, 
 которая дает семафору имя и ему начальное значение.
 Бинарные семафоры , еще называемые мьютексами , инициализируются с помощью init_MUTEX и init_MUTEX_LOCKED, 
 которые присваивают значение семафору  1 и 0, соответственно. 
 В линуксе есть 3 типа операций с семафорами (semWait).
         1. Функция down является примером операции semWait. 
            Тред проверяет семафор иблокируется до тех пор . пока семафор не станет доступен. 
            Это состояние будет длиться до тех пор , пока семафор не будет разблокирован. 
            Операция применима для счетчиков и бинарных семафоров. 
         2. Функция down_interruptible позволяет треду получить и послать сигнал треду в состоянии блокировки.
            При этом функция down_interruptible увеличивает значение семафора и возвращает ошибку ,
            известную в линуксе как  -EINTR. 
            Эта фича полезна для драйверов устройств.
         3. Функция down_trylock пытается получить семафор без блокировки
            Если семафор доступен , он блокируется. 
 
         Семафоры на чтение-запись
 Такие псемафоры делят пользователей на читателей и писателей; при этом четов может быть много ,
 но писателей только один . 
 Такие семафоры используют непрерывный цикл. 
 
         Barriers
 В некоторых архитектурах,компиляторы могут себе позволить поменять порядок доступа к памяти.
 Это делается для оптимизации конвейера команд процессора.
 При этом проводится проверка , что зависимости в данных не нарушаются.
 Например , в коде :
         a = 1; 
         b = 1; 
 две команды без проблем можно поменять местами
 И вто же время , в коде : 
         a = 1; 
         b = a; 
 этого уже не сделаешь.
  Для изменения порядка , в линуксе используется т.н  memory barrier facility. 
 Операция rmb() гаранттровано блокирует чтение в коде. 
 Операция wmb() гарантировано блокирует запись. 
         1. barriers относятся к машинным инструкциям , называемым loads and stores. 
 		   Инструкция вида a = b читает из b и пишет в a. 
         2. Операции rmb, wmb, mb диктуют поведение как компилятору , так и процессору. 
            Компилятору будет предписано ничего не менять в процессе компиляции.
 
 Linux Memory Barrier Operations
 rmb() 
 wmb() 
 mb() 
 barrier() 
 smp_rmb() 
 smp_wmb() 
 smp_mb()

Конфликты / блокировка

Синхронизация необходима тогда , когда возникает т.н. concurrency , или конфликт : Concurrency возникает например тогда , когда 2 процесса пытаются поиметь один ресурс.

Concurrency can occur on uniprocessor (UP) hosts where multiple threads share the same CPU and preemption creates race conditions. Preemption is sharing the CPU transparently by temporarily pausing one thread to allow another to execute. A race condition occurs when two or more threads manipulate a shared data item and the result depends upon timing of the execution. Concurrency also exists in multiprocessor (MP) machines, where threads executing simultaneously in each processor share the same data. Note that in the MP case there is true parallelism because the threads execute simultaneously. In the UP case, parallelism is created by preemption. The difficulties of concurrency exist in both modes.

The Linux kernel supports concurrency in both modes. The kernel itself is dynamic, and race conditions can be created in a number of ways. The Linux kernel also supports multiprocessing known as symmetric multiprocessing (SMP). You can learn more about SMP in the Resources section later in this article.

To combat the issue of race conditions, the concept of a critical section was created. A critical section is a portion of code that is protected against multiple access. This portion of code can manipulate shared data or a shared service (such as a hardware peripheral). Critical sections operate on the principle of mutual exclusion (when a thread is within a critical section, all other threads are excluded from entering).

A problem created within critical sections is the condition of deadlock. Consider two separate critical sections, each protecting a different resource. For each resource, a lock is available, called A and B in this example. Now consider two threads that require access to the resources. Thread X takes lock A, and thread Y takes lock B. While those locks are held, each thread attempts to take the other lock that is currently held by the other thread (thread X wants lock B, and thread Y wants lock A). The threads are now deadlocked because they each hold one resource and want the other. A simple solution is to always take locks in the same order, which allows a thread to complete. Other solutions involve detecting this situation. Table 1 defines important concurrency terms covered here.

Table 1. Компоненты concurrency

Term	Definition
Race condition	Situation where simultaneous manipulation of a resource by two or more threads causes inconsistent results.
Critical section	Segment of code that coordinates access to a shared resource.
Mutual exclusion	Property of software that ensures exclusive access to a shared resource.
Deadlock	Special condition created by two or more processes and two or more resource locks that keep processes from doing productive work.

Now that you have a little theory under your belt and an understanding of the problem to be solved, let's look at the various ways that Linux supports concurrency and mutual exclusion. In the early days, mutual exclusion was provided by disabling interrupts, but this form of locking is inefficient (even though you can still find traces of it in the kernel). This method also doesn't scale well and doesn't guarantee mutual exclusion on other processors.

In the following review of locking mechanisms, we first look at the atomic operators, which provide protection for simple variables (counters and bitmasks). Simple spinlocks and reader/writer spinlocks are then covered as an efficient busy-wait lock for SMP architectures. Finally, we explore kernel mutexes, which are built on top of the atomic API.

The simplest means of synchronization in the Linux kernel are the atomic operations. Atomic means that the critical section is contained within the API function. No locking is necessary because it's inherent in the call. As C can't guarantee atomic operations, Linux relies on the underlying architecture to provide this. Since architectures differ greatly, you'll find varying implementations of the atomic functions. Some are provided almost entirely in assembly, while others resort to C and disabling interrupts using local_irq_save and local_irq_restore.

Устаревшие методы блокировки
A bad way to implement locking in the kernel is by disabling hard interrupts for the local CPU. These functions are available and are used (sometimes by the atomic operators) but are not recommended. The local_irq_save routine disables interrupts, and local_irq_restore restores the previously-enabled interrupts. These routines are reentrant, meaning they may be called within context of one another.

The atomic operators are ideal for situations where the data you need to protect is simple, such as a counter. While simple, the atomic API provides a number of operators for a variety of situations. Here's a sample use of the API.

To declare an atomic variable, you simply declare a variable of type atomic_t. This structure contains a single int element. Next, you ensure that your atomic variable is initialized using the ATOMIC_INIT symbolic constant. In the case shown in Listing 1, the atomic counter is set to zero. It's also possible to initialize the atomic variable at runtime using the atomic_set function.

Listing 1. Создание и инициализация атомарной переменной

                
 atomic_t my_counter ATOMIC_INIT(0);
 
 ... or ...
 
 atomic_set( &my_counter, 0 );

The atomic API supports a rich set of functions covering many use cases. You can read the contents of an atomic variable with atomic_read and also add a specific value to a variable with atomic_add. The most common operation is to simply increment the variable, which is provided with atomic_inc. The subtraction operators are also available, providing the converse of the add and increment operations. Listing 2 demonstrates these functions.

Listing 2. Простая арифметическая атомарная функция

                
 val = atomic_read( &my_counter );
 
 atomic_add( 1, &my_counter );
 
 atomic_inc( &my_counter );
 
 atomic_sub( 1, &my_counter );
 
 atomic_dec( &my_counter );

The API also supports a number of other common use cases, including the operate-and-test routines. These allow the atomic variable to be manipulated and then tested (all performed as one atomic operation). One special function called atomic_add_negative is used to add to the atomic variable and then return true if the resulting value is negative. This is used by some of the architecture-dependent semaphore functions in the kernel.

While many of the functions do not return the value of the variable, two in particular operate and return the resulting value (atomic_add_return and atomic_sub_return), as shown in Listing 3.

Listing 3. Принцип работы атомарных функций

                
 if (atomic_sub_and_test( 1, &my_counter )) {
   // my_counter is zero
 }
 
 if (atomic_dec_and_test( &my_counter )) {
   // my_counter is zero
 }
 
 if (atomic_inc_and_test( &my_counter )) {
   // my_counter is zero
 }
 
 if (atomic_add_negative( 1, &my_counter )) {
   // my_counter is less than zero
 }
 
 val = atomic_add_return( 1, &my_counter ));
 
 val = atomic_sub_return( 1, &my_counter ));

If your architecture supports 64-bit long types (BITS_PER_LONG is 64), then long_t atomic operations are available. You can see the available long operations in linux/include/asm-generic/atomic.h.

You'll also find support for bitmask operations with the atomic API. Rather than arithmetic operations (as explored earlier), you'll find set and clear operations. Many drivers use these atomic operations, particularly SCSI. The use of bitmask atomic operations is slightly different than arithmetic because only two operations are available (set mask and clear mask). You provide a value and the bitmask upon which the operation is to be performed, as shown in Listing 4.

Listing 4. Bitmask atomic functions

                
 unsigned long my_bitmask;
 
 atomic_clear_mask( 0, &my_bitmask );
 
 atomic_set_mask( (1<<24), &my_bitmask );

Прототипы атомарных API
Implementations for the atomic operations are architecture dependent and can therefore be found in ./linux/include/asm-<arch>/atomic.h.

Spinlocks

Spinlocks are a special way of ensuring mutual exclusion using a busy-wait lock. When the lock is available, it is taken, the mutually-exclusive action is performed, and then the lock is released. If the lock is not available, the thread busy-waits on the lock until it is available. While busy-waiting may seem inefficient, it can actually be faster than putting the thread to sleep and then waking it up later when the lock is available.

Spinlocks are really only useful in SMP systems, but because your code will end up running on an SMP system, adding them for UP systems is the right thing to do.

Spinlocks are available in two varieties: full locks and reader/writer locks. Let's look at the full lock variety first.

First you create a new spinlock through a simple declaration. This can be initialized in place or through a call to spin_lock_init. Each of the variants shown in Listing 5 accomplishes the same result.

Listing 5. Создание и инициализация спинлока

                
 spinlock_t my_spinlock = SPIN_LOCK_UNLOCKED;
 
 ... or ...
 
 DEFINE_SPINLOCK( my_spinlock );
 
 ... or ...
 
 spin_lock_init( &my_spinlock );

Now that you have a spinlock defined, there are a number of locking variants that you can use. Each is useful in different contexts.

First is the spin_lock and spin_unlock variant shown in Listing 6. This is the simplest and performs no interrupt disabling but includes full memory barriers. This variant assumes no interactions with interrupt handlers and this lock.

Listing 6. Спинлок : функции блокировки и разблокировки

                
                 spin_lock( &my_spinlock );
 
 // critical section
 
 spin_unlock( &my_spinlock );

Next is the irqsave and irqrestore pair shown in Listing 7. The spin_lock_irqsave function acquires the spinlock and disables interrupts on the local processor (in the SMP case). The spin_unlock_irqrestore function releases the spinlock and restores the interrupts (via the flags argument).

Listing 7. Спинлок для CPU interrupt disable

                
                 spin_lock_irqsave( &my_spinlock, flags );
 
 // critical section
 
 spin_unlock_irqrestore( &my_spinlock, flags );

A less safe variant of spin_lock_irqsave/spin_unlock_irqrestore is spin_lock_irq/spin_unlock_irq. I recommend that you avoid this variant because it makes assumptions about interrupt states.

Finally, if your kernel thread shares data with a bottom half, then you can use another variant of the spinlock. A bottom half is a way to defer work from interrupt handling to be done later in a device driver. This version of the spinlock disables soft interrupts on the local CPU. This has the effect of preventing softirqs, tasklets, and bottom halves from running on the local CPU. This variant is shown in Listing 8.

Listing 8. Спинлок-функции для низко-уровневого взаимодействия

                
                 spin_lock_bh( &my_spinlock );
 
 // critical section
 
 spin_unlock_bh( &my_spinlock );

In many cases, access to data is indicated by many readers and less writers (accessing the data for read is more common than accessing for write). To support this model, reader/writer locks were created. What's interesting with this model is that multiple readers are permitted access to the data at one time, but only one writer. If a writer has the lock, no reader is allowed to enter the critical section. If only a reader has the lock, then multiple readers are permitted in the critical section. Listing 9 demonstrates this model.

Listing 9. Спинлок-функции чтения-записи

                
 rwlock_t my_rwlock;
 
 rwlock_init( &my_rwlock );
 
 write_lock( &my_rwlock );
 
 // critical section -- can read and write
 
 write_unlock( &my_rwlock );
 
 
 read_lock( &my_rwlock );
 
 // critical section -- can read only
 
 read_unlock( &my_rwlock );

You'll also find variants of reader/writer spinlocks for bottom halves and interrupt request (IRQ) saving depending on the situation for which you require the lock. Obviously, if your use of the lock is reader/writer in nature, this spinlock should be used over the standard spinlock, which doesn't differentiate between readers and writers.

Mutexes are available in the kernel as a way to accomplish semaphore behavior. The kernel mutex is implemented on top of the atomic API, though this is not visible to the kernel user. The mutex is simple, but there are some rules you should remember. Only one task may hold the mutex at a time, and only this task can unlock the mutex. There is no recursive locking or unlocking of mutexes, and mutexes may not be used within interrupt context. But mutexes are faster and more compact than the current kernel semaphore option, so if they fit your need, they're the choice to use.

You create and initialize a mutex in one operation through the DEFINE_MUTEX macro. This creates a new mutex and initializes the structure. You can see this implementation in ./linux/include/linux/mutex.h.

                DEFINE_MUTEX( my_mutex );

The mutex API provides five functions: three are used for locking, one for unlocking, and another for testing a mutex. Let's first look at the locking functions. The first function, mutex_trylock, is used in situations where you want the lock immediately or you want control to be returned to you if the mutex is not available. This function is shown in Listing 10.

Listing 10. Мьютекс и mutex_trylock

                
 ret = mutex_trylock( &my_mutex );
 if (ret != 0) {
   // Got the lock!
 } else {
   // Did not get the lock
 }

If, instead, you want to wait for the lock, you can call mutex_lock. This call returns if the mutex is available; otherwise, it sleeps until the mutex is available. In either case, when control is returned, the caller holds the mutex. Finally, the mutex_lock_interruptible is used in situations where the caller may sleep. In this case, the function may return -EINTR. Both of these calls are shown in Listing 11.

Listing 11. Блокировка мьютекса

                
                 mutex_lock( &my_mutex );
 
 // Lock is now held by the caller.
 
 if (mutex_lock_interruptible( &my_mutex ) != 0)  {
 
   // Interrupted by a signal, no mutex held
 
 }

After a mutex is locked, it must be unlocked. This is accomplished with the mutex_unlock function. This function cannot be called from interrupt context. Finally, you can check the status of a mutex through a call to mutex_is_locked. This call actually compiles to an inline function. If the mutex is held (locked), then one is returned; otherwise, zero. Listing 12 demonstrates these functions.

Listing 12. Тестирование мьютекса и mutex_is_locked

                
                 mutex_unlock( &my_mutex );
 
 if (mutex_is_locked( &my_mutex ) == 0) {
 
   // Mutex is unlocked
 
 }

While the mutex API has its constraints because it's based upon the atomic API, it's efficient and, therefore, worthwhile if it fits your need.

Finally, there remains the big kernel lock (BKL). Its use in the kernel is diminishing, but the remaining uses are the most difficult to remove. The BKL made multiprocessor Linux possible, but finer-grained locks have slowly replaced the BKL. The BKL is provided through the functions lock_kernel and unlock_kernel. See ./linux/lib/kernel_lock.c for more information.

Управление памятью в LINUX
Linux впитал в себя различные механизмы реализации управления памяью , присущие юниксу,
и в то же время имеет свои уникальные фичи.

Виртуальная память в Linux
В Linux реализована 3-уровневая организация страниц памяти,
состоящая из 3-х типов страниц :
Page directory: каждый процесс имеет одну такую страницу .
Эта страница содержит указатели на страницы памяти 2-го уровня - page middle directory.
Она присутствует в основной памяти активных процессов.
Page middle directory: содержит указатели на страницы 3-го уровня - page table.
Page table: указывает на виртуальные страницы самого процесса

To use this three-level page table structure, a virtual address in Linux is viewed as consisting of four fields
The leftmost (most significant) field is used as an index into the page directory.
The next field serves as an index into the page middle directory.
The third field serves as an index into the page table.
The fourth field gives the offset within the selected page of memory.
The Linux page table structure is platform independent and was designed to accommodate the 64-bit Alpha processor,
which provides hardware support for three levels of paging.
With 64-bit addresses, the use of only two levels of pages on the Alpha would result in very large page tables
and directories. The 32-bit Pentium/x86 architecture has a two-level hardware paging mechanism.
The Linux software accommodates the two-level scheme by defining the size of the page middle directory as one.
Note that all references to an extra level of indirection are optimized away at compile time, not at run time.
Therefore, there is no performance overhead for using generic three-level design on platforms which support
only two levels in hardware.

Page Allocation
To enhance the efficiency of reading in and writing out pages to and from main memory,
Linux defines a mechanism for dealing with contiguous blocks of pages mapped into contiguous blocks of page frames.
For this purpose, the buddy system is used. The kernel maintains a list of contiguous page frame groups
of fixed size; a group may consist of 1, 2, 4, 8, 16, or 32 page frames.
As pages are allocated and deallocated in main memory, the available groups are split and merged using
the buddy algorithm.

Page Replacement Algorithm
The Linux page replacement algorithm is based on the clock algorithm described in Section 8.2 .
In the simple clock algorithm, a use bit and a modify bit are associated with each page in main memory.
In the Linux scheme, the use bit is replaced with an 8-bit age variable.
Each time that a page is accessed, the age variable is incremented. In the background,
Linux periodically sweeps through the global page pool and decrements the age variable for each page
as it rotates through all the pages in main memory. A page with an age of 0 is an "old" page
that has not been referenced in some time and is the best candidate for replacement.
The larger the value of age, the more frequently a page has been used in recent times and the less eligible
it is for replacement. Thus, the Linux algorithm is a form of least frequently used policy.

Kernel Memory Allocation
The Linux kernel memory capability manages physical main memory page frames.
Its primary function is to allocate and deallocate frames for particular uses.
Possible owners of a frame include user-space processes (i.e., the frame is part of the virtual memory
of a process that is currently resident in real memory), dynamically allocated kernel data,
static kernel code, and the page cache.2
The foundation of kernel memory allocation for Linux is the page allocation mechanism used
for user virtual memory management. As in the virtual memory scheme, a buddy algorithm is used so
that memory for the kernel can be allocated and deallocated in units of one or more pages.
Because the minimum amount of memory that can be allocated in this fashion is one page,
the page allocator alone would be inefficient because the kernel requires small short-term memory chunks
in odd sizes. To accommodate these small chunks, Linux uses a scheme known as slab allocation [BONW94]
within an allocated page. On a Pentium/x86 machine, the page size is 4 Kbytes, and chunks within a page
may be allocated of sizes 32, 64, 128, 252, 508, 2040, and 4080 bytes.
The slab allocator is relatively complex and is not examined in detail here;
a good description can be found in [VAHA96]. In essence, Linux maintains a set of linked lists,
one for each size of chunk. Chunks may be split and aggregated in a manner similar to the buddy algorithm,
and moved between lists accordingly.

LINUX SCHEDULING
For Linux 2.4 and earlier, Linux provided a real-time scheduling capability coupled with a scheduler
for non-real-time processes that made use of the traditional UNIX scheduling algorithm described in Section 9.3.
Linux 2.6 includes essentially the same real-time scheduling capability as previous releases
and a substantially revised scheduler for non-real-time processes. We examine these two areas in turn.

Real-Time Scheduling
The three Linux scheduling classes are:
SCHED_FIFO: First-in-first-out real-time threads
SCHED_RR: Round-robin real-time threads
SCHED_OTHER: Other, non-real-time threads
Within each class, multiple priorities may be used, with priorities in the real-time classes higher than
the priorities for the SCHED_OTHER class. The default values are as follows:
Realtime priority classes range from 0 to 99 inclusively, and SCHED_OTHER classes range from 100 to 139.
A lower number equals a higher priority. For FIFO threads, the following rules apply:
1. The system will not interrupt an executing FIFO thread except in the following cases:
a. Another FIFO thread of higher priority becomes ready.
b. The executing FIFO thread becomes blocked waiting for an event, such as I/O.
c. The executing FIFO thread voluntarily gives up the processor following a call to the primitive
sched_yield.
2. When an executing FIFO thread is interrupted, it is placed in the queue associated with its priority.
3. When a FIFO thread becomes ready and if that thread has a higher priority than the currently executing
thread, then the currently executing thread is preempted and the highest-priority ready
FIFO thread is executed. If more than one thread has that highest priority,
the thread that has been waiting the longest is chosen.
The SCHED_RR policy is similar to the SCHED_FIFO policy, except for the addition of a timeslice associated
with each thread. When a SCHED_RR thread has executed for its timeslice, it is suspended and a real-time thread
of equal or higher priority is selected for running. Figure 10.10 is an example that illustrates
the distinction between FIFO and RR scheduling.
Assume a process has four threads with three relative priorities assigned as shown in Figure 10.10a.
Assume that all waiting threads are ready to execute when the current thread waits or terminates and that
no higher-priority thread is awakened while a thread is executing.
Figure 10.10b shows a flow in which all of the threads are in the SCHED_FIFO class.
Thread D executes until it waits or terminates.
Next, although threads B and C have the same priority, thread B starts because it has been waiting longer
than thread C. Thread B executes until it waits or terminates, then thread C executes
until it waits or terminates. Finally, thread A executes.
Figure 10.10c shows a sample flow if all of the threads are in the SCHED_RR class.
Thread D executes until it waits or terminates.
Next, threads B and C are time sliced, because they both have the same priority.
Finally, thread A executes. The final scheduling class is SCHED_OTHER.
A thread in this class can only execute if there are no real-time threads ready to execute.

Non-Real-Time Scheduling
The Linux 2.4 scheduler for the SCHED_OTHER class did not scale well with increasing number of processors
and increasing number of processes. To address this problem, Linux 2.6 uses a completely new scheduler
known as the O(1) scheduler.3 The scheduler is designed so that the time to select the appropriate process
and assign it to a processor is constant, regardless of the load on the system or the number of processors.
The kernel maintains two scheduling data structure for each processor in the system,
of the following form (Figure 10.11):
struct prio_array {
int nr_active; /* number of tasks in this array*/
unsigned bitmap[BITMAP_SIZE]; /* priority bitmap */
struct list_head queue[MAX_PRIO]; /* priority queues */
A separate queue is maintained for each priority level.
The total number of queues in the structure is MAX_PRIO, which has a default value of 140.
The structure also includes a bitmap array of sufficient size to provide one bit per priority level.
Thus, with 140 priority levels and 32bit words, BITMAP_SIZE has a value of 5.
This creates a bitmap of 160 bits, of which 20 bits are ignored.
The bitmap indicates which queues are not empty.
Finally, nr_active indicates the total number of tasks present on all queues.
Two structures are maintained: an active queues structure and an expired queues structure.
Initially, both bitmaps are set to all zeroes and all queues are empty. As a process becomes ready,
it is assigned to the appropriate priority queue in the active queues structure
and is assigned the appropriate timeslice. If a task is preempted before it completes its timeslice,
it is returned to an active queue. When a task completes its timeslice, it goes into the appropriate queue
in the expired queues structure and is assigned a new timeslice.
All scheduling is done from among tasks in the active queues structure.
When the active queues structure is empty, a simple pointer assignment results in a switch
of the active and expired queues, and scheduling continues.
Scheduling is simple and efficient. On a given processor, the scheduler picks the highestpriority
nonempty queue. If multiple tasks are in that queue, the tasks are scheduled in roundrobin fashion.
Linux also includes a mechanism for moving a tasks from the queue lists of one processor to that of another.
Periodically, the scheduler checks to see if there is a substantial imbalance among the number of tasks assigned
to each processor. To balance the load, the schedule can transfer some tasks.
The highest priority active tasks are selected for transfer, because it is more important to distribute
high-priority tasks fairly.

Calculating Priorities and Timeslices
Each non-real-time task is assigned an initial priority in the range of 100 to 139, with a default of 120.
This is the task's static priority and is specified by the user.
As the task executes, a dynamic priority is calculated as a function of the task's static priority
and its execution behavior. The Linux scheduler is designed to favor I/O-bound tasks over processor-bound tasks.
This preference tends to provide good interactive response.
The technique used by Linux to determine the dynamic priority is to keep a running tab on how much time
a process sleeps (waiting for an event) versus how much time the process runs.
In essence, a task that spends most of its time sleeping is given a higher priority.
Timeslices are assigned in the range of 10 ms to 200 ms. In general, higher-priority tasks are assigned larger
timeslices.

Relationship to Real-Time Tasks
Real-time tasks are handled in a different manner from non-real-time tasks in the priority queues.
The following considerations apply:
1. All real-time tasks have only a static priority; no dynamic priority changes are made.
2. SCHED_FIFO tasks do not have assigned timeslices. Such tasks are scheduled in FIFO discipline.
If a SHED_FIFO task is blocked, it returns to the same priority queue in the active queue list
when it becomes unblocked.
3. Although SCHED_RR tasks do have assigned timeslices, they also are never moved to the expired
queue list. When a SCHED_RR task exhaust its timeslice, it is returned to its priority queue
with the same timeslice value. Timeslice values are never changed.
The effect of these rules is that the switch between the active queue list and the expired queue list
only happens when there are no ready real-time tasks waiting to execute.

Оставьте свой комментарий !

Ваше имя:

Комментарий:

Оба поля являются обязательными

Автор	Комментарий к данной статье