Serg Iakovlev

or:

and:

LINUX

Language

Kernel

Package

Book

Test

Forum

iakovlev.org

Memory Management 1
by Tim Robinson

Introduction

Управление физической памятью - низкоуровневый компонент любой ОС. В ядре может быть несколько менеджеров памяти , и каждый может работать на своем уровне .
Предполагается , что ядро работает с админскими правами и в защищенном режиме . Для x86 это означает , что это уровень 0 (PL0) .

Specifications
Когда управление передается из загрузчика в ядро , вы получаете полный контроль над системой. Код , данные и стек разнесены при этом на разные сектора . Вы напечатали на экране “Hello world” - и что же дальше ?
Задача менеджера памяти проста - разложить всю доступную память по полочкам . Память мы разбиваем на страницы - “pages”, одна страница - 4KB .
Процессор x86 уникален в том смысле , что он оперирует с сегментами и страницами : сначала идет ссылка на сегмент , а потом на страницу внутри него .

*	Большинство других архитектур разбивает адресное пространство на страницы . А в x86-архитектуре существует принцип сегментации по признаку кода,данных и стека . И нет никакой возможности перенести это на другие архитектуры , не поддерживающие сегментацию.

Не используя paging, на x86 можно получить доступ лишь к 16MB физической памяти. В 16-битном пространстве имеется 24-битная шина, дающая доступ лишь к 16MB. Не имея сегментации , мы ограничены 64KB. С приходом 386 появилась возможность разбить память на 4KB pages. У этого метода есть свои недостатки - например , невозможность выделить память менее 4 КБ.

Менеджер памяти делает вашу ось независимой от количества реальной физической памяти . Начиная с пентиума , адресация становится 32-битной .

Organization

В хидер каждой выделяемой страницы мы положим определенную информацию .
Нам нужно иметь указатель на выделяемую страницу , мы должны читать из нее и писать в нее .

Вначале нужно подумать о том , как разбить адресное пространство на user и kernel regions . user regions будет состоять из кода и данных пользователя , kernel regions - из собственно ядра и драйверов . Ядро должно быть недоступным для пользователя . Ядро можно например положить в первый мегабайт памяти , пользовательское пространство - во второй . Нужно скорректировать адрес страницы на корректный физический адрес .

Если мы имеем 256MB памяти , нам нужно 8192 байт для управления 65,536 страницами . При выделении памяти под процесс , нам нужно знать ее размер , идентификатор процесса и пользователя .

Для каждой страницы будет существовать бит , который указывает на то , выделена ли память или нет . В сумме мы будем иметь массив таких бит для всего множества страниц

Возможен другой механизм управления страницами - стековый , при котором адреса свободных невыделенных страниц помещаются в стек .

Initialization
Для того чтобы определить количество доступной физической памяти , при загрузке надо вызвать прерывание BIOS 15h .

Далее , нам нужно будет выделить статический кусок памяти для управления всей остальной памятью , и поместить его в конец ядра .

Мы знаем стартовый и конечный адрес ядра - его можно взять из GNU ld script. Кусок памяти мы назовем bitmap и положим сразу после ядра .

Allocation
Первой функцией менеджера памяти будет выделение физической страницы . Страница будет промаркирована как выделенная и будет возвращен ее адрес .
Если свободные страницы заканчиваются , нужно свопировать на диск .

Mapping
This is where the paging architecture starts to get useful. In short (and the processor manual explains this better), paging allows you to control the address space from software. You can map pages anywhere you like; you can apply protection to pages; and you can allocate and map pages on demand via page faults (or even emulate memory access completely).

I’ll leave the MMU description to the hardware manuals, and for this, I’ll stick to the x86 (although other architectures are fairly similar). The 386 and above use a three-level translation scheme: the PDBR (Page Directory Base Register, called CR3) contains the physical address of a page directory. The page directory is a page in memory split into 1,024 32-bit words, or Page Directory Entries, each of which is the physical address of a page table. Each page table is, too, split into 1,024 Page Table Entries; each of these words is the physical address of a page in memory.

Note that, to span 4GB of 4096-byte pages, you don’t need 32 bits for each physical address. The PDBR, PDEs and PTEs only need 20 bits each, so the bottom 12 bits are used as flags. These flags can apply protection (read/write and user vs. supervisor) to pages and page tables, and they can mark individual pages and page tables as present or not present. There is also an ‘accessed’ flag, which the processor sets when a page or page table is accessed. By switching the PDBR you can use different page directories and page tables in different contexts, thereby causing separate applications to use different address spaces.

Paging is immensely powerful, and its true benefits aren’t fully realized until you start writing user applications to make use of the address space. Now we need to write functions in our memory manager to maintain the various page directories and page tables.

Remember that page directories and page tables are just normal pages in memory; the processor needs to know their physical addresses. There’s only one page directory per process (or per address space) and there can be up to 1,024 page tables per page directory. We can allocate these as normal from our physical allocator we wrote before. Remember that page directory and page table addresses need to be page-aligned: that is, the bottom 12 bits need to be zero (otherwise they would interfere with the read/write/protection/accessed bits).

Before we enable paging we need valid a PDBR and page directory, and a valid page table for wherever we’re executing. The current page table needs to identity map the current EIP: when we set the paging bit in CR0, the CPU will still be executing from the same address, so it had better have some valid instructions at wherever it’s executing. This is why it’s good to locate the kernel at whatever address it’s going to end up once paging is enabled.

So before we enable paging, we need to allocate a page directory and one page table. You could use the page allocator or you could reserve giant (4096-byte) global variables: either way it’s the same. If you’re using global variables for your page directory/page table, be sure that they’re page-aligned.

Zero all the entries you’re not using. This will clear the present bit so that accessing the addresses they cover will cause a page fault. You could set the other bits to some distinctive pattern if you want your page faults to look good. Remember that each PDE (and page table) covers 4MB of your address space, which should be enough for your kernel – if not, I’d like to see it. Each PTE covers only 4KB, so you’ll probably need several PTEs to cover your kernel (again, if not, I’d like to see it…). Assuming you’re not relocating your kernel dynamically (and why? Since it’s the first user of the address space, it’s not likely to clash), you can just poke numbers into the page directory and page table you’ve allocated as appropriate. You’ll be making your memory-mapping function more sophisticated later.

All of this needs to go into your memory manager initialization routine, which I covered before. Since your memory manager should separate all the machine-specific memory stuff from the rest of the kernel, it’s polite to enable paging before your return, and it’s polite to keep the address format the same (if not, then the return address on the stack from the last function call will be wrong). So set up your page directory and page tables, enable paging, do a far jump (to reload CS) and reload the other selectors (SS, DS, ES, FS and GS) as appropriate. Now that you’ve enabled paging it makes sense to use segments whose base addresses are zero, although there’s no reason why this should be mandatory.

With luck, we’re now running with paging enabled. All addresses now are going through the page directory and page tables we established before. Note that the rest of physical memory is not necessarily mapped into our new address space – all that is necessary is the part of code that the CPU is currently executing (in practice, you’d map the whole kernel: data, stack and all). We’d like to map in video memory so we can print some more messages, and the full memory-mapping routine comes in useful.

Memory mapping
At first glance this appears pretty easy. Just poke words into the right pages in the current page directory and page table; allocate a new page table if required. However, the CPU uses physical addresses for page directories and page tables; we’re using virtual addresses for everything. There’s a number of ways around this:
1) Map all physical memory into the address space. This can either be done 1:1 (that is, physical memory is addressed by the bottom of the address space) or at some offset (that is, physical memory is accessible starting at, say, 0xD0000000). This approach’s advantage is its simplicity (Win9x uses this); however, its disadvantage is the fact that the user may have any amount of memory installed in their system, all of which must be addressable. Imagine if the user had 4GB installed: there would be no address space left…
2) Map each page into the address space and keep track of their virtual addresses in a virtual page directory parallel to the real one. The virtual page directory can store the virtual addresses of each of the page tables while the real page directory stores their physical addresses. This is good if the only pieces of physical memory which must be addressed directly are the page directories/page tables; however, it increases the amount of space taken up just by mapping – not good in a small system.
3) Map the page directory into itself. This might seem like a kind of weird fractal memory mapper, but it works well in practice. By setting one of the fixed PDEs to the physical address of the associated page directory, you can address PDEs and PTEs as separate addresses. If you set element 1023 of each page directory to the physical address of the page directory itself, the processor will see the page directory as the last page table. It will see the PDEs as PTEs, and it will see the PTEs as individual 32-bit words in the top 4MB of the address space. You can use the top 4KB of the address space as the entries in the original page directory. This has the advantage of being beautiful yet simple; it has the disadvantage that you can only access page mappings inside the current address space.
By way of an example, Windows NT maps up to 512MB of physical memory into the kernel’s address space (as in option 1) while mapping the page tables directly into the address space (as in option 3). Personally, I’d go for the third option, even though it takes some thinking about to get your head round it. The first option also has its advantages for a simple kernel. Either way, the page mapper’s job is simple. Remember to apply the correct protection at each stage: on the PTE, apply the desired protection for that page; on the PDE, apply the desired protection for that 4MB region. Each PDE should normally be made read/write and user-visible unless you have a good reason for making the whole 4MB region inaccessible from user mode.

Overview
That’s about it for our simple physical memory manager. If you want HAL-style machine abstraction you’ll probably need an address space switching function callable from the scheduler (on the x86, this is just MOV CR3, addr). If you’re going for a full VMS- or NT-style asynchronous device architecture you’ll need a routine which can lock a buffer in the virtual address space and record the physical pages associated with it; however, that comes under ‘advanced’ design and you probably won’t need it for a simple Minix or Linux model.

Before we can use our address fully, we’ll need a more sophisticated memory manager, so here’s a look forward to part 2 of the memory management tutorial.

Memory Management 2
by Tim Robinson

Introduction
You might have already read my “Memory Management 1” tutorial; if not, then you should go and read it now, because this tutorial relies on it. Here I’ll be describing the next level of memory management: turning the memory manager I wrote about in the first tutorial into something useful and general-purpose.

Remember that the low-level memory manager deals with the computer’s physical memory: that is, the raw address space of your PC before things like paging come into effect. The low-level manager has three main components: an allocator (for allocating one physical page at a time), a deallocator (for deallocating those pages) and a memory mapper (for modifying mappings between the virtual address space and physical memory). Now it’s time to make something useful out of these.

There are a few problems with the physical memory functions which make them unsuitable for everyday memory allocation like, say, malloc():

*	They only allocate whole pages at a time. This is fine if you only ever want to allocate memory in multiples of 4KB (on the x86), but not much use if you’re using things like linked lists (which you will be).

*	They deal with physical addresses, not virtual. This means that they don’t naturally take advantage of a paged architecture, which would allow your kernel to arrange applications’ address spaces independently of the layout of the computers RAM chips.

*	They don’t do much with the memory they allocate: they don't allow things like sharing, swapping to disk, memory-mapped files, integration with the disk cache, etc.

What we need to do is write a layer of functions on top of the physical memory manager which deal with these problems. Note that this memory manager, like most of the kernel, can be made portable: the things it deals with are generally the same across different architectures (that is, until you start dealing with things like the Z80 and the 80286).

Specifications
What we’re going to do is, effectively, write two memory managers: one which can manage the virtual address space for user applications, and one which implements malloc() and free() (or the equivalents in your language of choice) for your kernel. A virtual memory manager is useful for providing a large-block memory interface to your user applications: sure, it would be OK to expose the physical memory manager directly to user mode, but that would mean that user apps needed to know about the physical architecture of the target machine, and surely we’re writing a kernel to avoid that kind of thing?

It’s good to write versions of malloc() and free() for the kernel because it makes writing kernel (and driver) code easier and more enjoyable. I’m a great believer in providing a good run-time library interface in kernel mode, even if it means that the kernel grows in size. Different kernels provide these RTL facilities to different extents; the Linux kernel provides a small subset of the C RTL, and Windows NT provides its own set of functions which mirror those provided by the C standard. Linux calls its malloc() kmalloc(); Windows NT has ExAllocatePool() and friends; I’ll be sticking with the familiar malloc().

malloc() and free()
I’ll start with these two because they are probably the simplest to implement. The goal of any malloc() is to take a large slab of memory and divide it up into pieces as requested, and that is what we’ll be doing in the kernel. Here, we need to set aside a portion of the kernel’s address space (remember, the kernel’s address space ought to be separate from the user address space, in the interests of memory protection) from where malloc()’s memory will come.

I like to split my kernel’s address space into several regions. The MпїЅius kernel, for example, starts at address 0xC0000000; that is, it occupies the top gigabyte. Kernel memory is split up into 256MB-regions as follows:

0xC0000000
Kernel code, data, bss, etc.
Probably too much but, hey, we might see a 256MB kernel one day.
0xD0000000
Kernel heap
0xE0000000
Space reserved for device drivers
0xF0000000
Some physical memory (useful for video memory access)
Page directory and page tables of the current process

Note that 256MB of address space, from 0xD0000000 to 0xE0000000 is devoted to the kernel’s heap. This doesn’t mean that 256MB of physical memory will be given over to the kernel heap at all times; it just means that there is a 256MB window through which heap memory is accessed.

I’m not going to go into telling you how to write the malloc() and free() functions themselves; other people can do that better than me. There is a sample implementation in the book The C Programming Language by Kernighan and Ritchie, and there are several slot-in implementations on the Internet. However, one thing all malloc() implementations should have in common is a hook into the OS kernel to request more memory. In the more conventional user-mode environment, this function (often called morecore() or sbrk()) will ask the kernel for a large block of memory in order to enlarge the process’s address space. Here, we are the kernel, so we need to write our own morecore().

Incidentally, there is a slight difference between morecore() (as featured in K&R) and sbrk() (as used by most Unix implementations):

void *morecore(size_t n)
Requests a block of size n bytes (K&R uses multiples of the allocation header size) which might come from anywhere in the application’s address space
char *sbrk(size_t d)
Adjusts the size of the application’s address space by d bytes and returns a pointer to the start of the new region. Returns char* because the earliest versions of C compilers had no void* type.

The difference is subtle but it makes sense to be clear about these definitions. I’ll be using the morecore() interface here, but it should be trivial to adapt it for sbrk(); in fact, there’s no reason why morecore() can’t work the same way as sbrk() internally.

Remember: malloc() calls morecore() when it runs out of memory. If it succeeds, the new block is added to the free block list (if it fails, malloc() returns NULL). In the kernel we need to increase the amount of space set aside for the kernel’s heap by n bytes and return a pointer to the start of the block. But morecore() wants to return an address inside the virtual address space, and so far we only know how to allocate physical pages. Well, the obvious thing to do would be to:

1) Allocate a page using the physical memory allocator
2) Map it at the end of the kernel’s heap using the physical memory mapper
3) Adjust the kernel’s end-of-heap pointer
4) Repeat from (1) until enough memory has been allocated to satisfy the request

Note that there is no provision made to free the physical memory associated with the heap. I don’t know of any malloc() implementation which will ask the OS to free the memory is has been given (the RTL usually assumes that the program’s memory will be freed when it exits) but it should be possible to walk the free block heap periodically and release any memory used. As is is, this algorithm will only allocate memory as it is used: if a lot of memory is suddenly allocated and then freed, it will remain allocated (as far as the kernel is concerned) but it will be re-used next time malloc() is called.

That’s all there is to the kernel side of malloc(). Remember that this malloc() is only for use by the kernel and by kernel-mode device drivers. User applications can have their own funky implementation of malloc(), or new, or GetMem() as required. These user-mode allocators can call the kernel’s virtual memory manager, which I’ll conveniently cover next.

The Virtual Memory Manager
The kernel’s malloc() is all well and good; we can go off and write strdup() and new and make linked list functions now. But it’s not very sophisticated. We can make a much more powerful allocator which can do more than allocate and free arbitrary small blocks of memory: it can do all the things I talked about in the Introduction. This is where we write the virtual memory manager.

As with the physical memory manager from part 1, the virtual memory manager will be managing the address space. However, this address space is the virtual one: it is a constant 4GB long (on a 32-bit machine) and it needn’t correspond directly with the underlying memory installed the machine. Parts of it can be mapped to the same physical memory (to share memory); parts of it can be mapped nowhere at all (to provide guard pages, for example, at the end of stacks); parts of it can be unmapped but can emulate real memory (like in Windows NT’s ntvdm.exe, which emulates the PC’s BIOS Data Area for real-mode apps).

The key to all of this is the page fault. If you’ve done much Windows programming you’ll probably be familiar with the page fault (along with the access violation) as one of the symptoms of a spurious pointer; Linux programmers (and users) will know it as a segmentation violation or bus fault. However, as applications programmers, we only see the page fault when things go wrong. In reality, the page fault is the processor’s way of telling the kernel that it doesn’t know how to handle an access to a particular region of memory, and that some software handler needs to be invoked to sort it out. Under the correct circumstances, the kernel will be able to do so and continue the execution of the application.

The most benign kind of page fault occurs when the processor tries to read from or write to a page which has been marked as ‘not present’. Remember that each entry in the page directory and each entry in each page table has a ‘Present’ bit. The processor checks this bit on every memory access (actually it will cache the PDEs and PTEs to speed up memory access). If it is set then the access can continue; if not, it writes the address in question to the register CR2 and invokes the page fault handler. This is installed the same way as a normal interrupt handler; see the interrupts tutorial for more details. I’ll assume that you have got your interrupt handler sufficiently advanced that control will be passed to a routine when an interrupt occurs.

So the page fault handler is executed when a page is accessed which is, as far as the MMU is concerned, not present. The page fault handler needs to:
1) Check what should be located at the faulting address and,
2) Make sure the address is accessible if it should be, or,
3) Take the appropriate action if the address is incorrect

The "appropriate action" in this case will probably involve terminating the application in question. Different operating systems take different approaches over this; Windows NT invokes the current set of Structured Exception Handlers, the default action of which is to terminate the application; Linux invokes the appropriate signal handler.

Allocator
In order to check what should be located at the faulting address we need to keep records of how the address space is organized. To start with, we can write a simple virtual memory allocator. Forget about page faults for a moment while I describe what our virtual memory allocator will do.

In short, we need to be able to make good use of the 4GB given to us by the hardware designers. This comes down to writing a routine which can allocate an arbitrary amount of virtual memory. Note that the physical memory behind it doesn’t need to be allocated all at once, and it doesn’t need to be contiguous. Of course the physical memory does eventually need to be there – the processor needs something to send over the address bus to the main memory – but it’s not essential to allocate it all at once. For now it’s enough to record all allocations that take place in a list. Our malloc() can come in useful here because you can allocate one record per allocation and store things like the virtual address of the allocation and how many pages were allocated. In fact, we need to record at least the allocation base address and the size so that we can allocate (“commit”) physical memory later. It’s usually more worthwhile dealing in pages here, rather than bytes, because it makes the kernel’s accounting easier. Remember that the user-mode RTL can implement a malloc()-style allocator itself which can divide large blocks of memory into smaller ones. As a reasonable minimum you ought to record for each allocation:

*	Virtual base address (i.e. the start of the block in the current application’s address space)

Size

*	Protection (e.g. read/write, user vs. kernel)

At this point we have a function which can ‘mentally’ allocate blocks of memory and record them somewhere in a list. If you’re implementing process-based protection in your kernel you ought to keep one list per process, to avoid confusion.

Once some application has a block allocated it can try to access it. However, there’s no physical memory associated with the block yet – the page table entries (and, in fact, the page directory entry) for that region are marked as ‘not present’. So the processor invokes a page fault, and calls our page fault handler.

The first thing the handler needs to do is check the allocation list to find the faulting address. The x86 processor records the actual address accessed, not the address of the start of the page, but this shouldn’t make any difference here. We need to walk the list to find a block which spans the faulting address.

Once the block is found, we need to commit it. This will involve allocating the relevant physical pages and mapping them into the address space, using the physical memory allocator from part 1. It’s up to you whether you commit an entire block at once or commit it page-by-page. Imagine an allocation of 4MB: that’s 1,024 pages. On the one hand, the application might write one byte at the start and then forget about it. If we committed the whole block on the first access, we’d waste 3.99MB (that is, until the application freed the block). Alternatively we could commit one page at once, and invoke the page fault handler when each page was accessed. If the application allocated the 4MB block and wrote to every byte we’d end up faulting 1,024 times: each fault has its overhead which would be unnecessary if the whole block was committed on the first fault. As with so many design decisions, it’s up to you.

I mentioned the processor’s page table cache before. This is called the Translation Lookaside Buffer and it is an internal copy of the PDEs and PTEs as they were when the cache was last invalidated. The cache is invalidated when CR3 is written to and when the INVLPG instruction is executed (on the 486 and above). Invalidating the whole TLB takes a while so it should only be done when necessary (e.g. when switching between two processes). The INVLPG instruction is useful because it instructs the processor to invalidate only the region of the TLB associated with one page. The TLB must be updated when any changes are made to the page directory or page tables (the processor doesn’t do it automatically) so it is necessary after allocating and mapping the physical memory here.

Assuming that the faulting address was correctly identified as part of an allocated block, and the relevant physical memory was allocated and mapped successfully, the kernel is free to return from the page fault handler and continue execution of the current application. The processor will retry the faulting instruction and, if the kernel has done its job, all will be well.

Executables
If we load an executable program into memory, we’ll have a record of the makeup of that executable – that is, what sections it contains, where they are located, and how big they are. They will be laid out in the application’s address space in some order, and they needn’t all be loaded at startup. This is similar to the virtual memory allocator I just described, except that we already have a record of the executable’s sections in the header. Why not map the file into memory as it appears on disk, headers and all. Then it becomes unnecessary to allocate the executable’s memory with the virtual allocator.

To do this we need to walk the list of executable image sections at the same time as we check the list of allocated blocks. This is a matter of starting at the base address of the executable (that is, the address of the start of the headers), parsing the headers and locating the section which spans the faulting address. It may be more efficient to cache the sections’ details in some other format, but most executable formats (including COFF, ELF and PE) keep it reasonable. Then we can write a “separate executable section committer” which will load an executable section from disk, allocate physical storage and map it. We can make some more optimizations here – for example, if code pages are read-only, it will be unnecessary to swap them to disk if they are already present in the on-disk image. Read-only code pages can also be shared among multiple instances, and data pages can be shared until they are written to; once they are written to, a copy is made. The principle of copying pages only when they are written to ("dirtied") is known as Copy-On-Write, and is commonly used.

The principle behind COW is that the originals of such pages are made read-only. The processor will fault when the application attempts to write to it – because they are data pages there’s no reason why the application should avoid writing to them – and the error code on the stack will reflect this. The page fault handler detects that the faulting address is contained within a COW page, so it can make a copy of the underlying physical memory, re-map it in the address space, and allow the application to continue. Because the mapping has changed, the application will continue with the copy, which the page fault handler has now made read-write.

Copy-on-write is also used in the common Unix fork() paradigm. The fork() system call, which has no parallel in standard Win32, creates a copy of the current process which is identical – it uses the same code pages, data pages and stack pages are shared, and execution continues at the same location. To save memory, as many pages are shared as possible. However, the new process and the old one are completely separate, so each is free to write to its shared copy of the data. Here COW is used to spawn new copies of the shared data when either process (parent or child) modifies them.

Swapping
Now that we’re able to allocate memory on demand, we can do some more advanced things with our virtual memory manager. Granted, we can only implement the more advanced features when we’ve got things like a disk driver and a process manager going, but the theory’s here. One useful feature is the ability of the virtual memory manager to swap pages to and from disk. This is present in most modern operating systems and it is responsible for the thrashing noise when the hard disk is accessed in a low-memory situation. The theory is simple: sooner or later, a block will need to be committed, and the physical memory allocator will run out of pages. We need to clear out some space for the new block: after all, the new block is being accessed right now, and there’s probably blocks in the system which haven’t been accessed for a while.

The swapper function needs to:
1) Find a block suitable for swapping out to disk. This will usually be the least recently used one, although Win9x will swap arbitrary blocks out randomly.
2) Make room for it in the swap file: talk to the file system and disk drivers for this. A fixed contiguous swap file is simplest (пїЅla Windows 3.1): at startup, record the disk sector numbers associated with the swap file and talk directly to the disk driver when necessary. Alternatively you could access the swap file as a normal file, via the file system driver.
3) Write the old block to disk, and record that is has been swapped out
4) Repeat from (1) until the physical allocation request succeeds

Note that if any of this goes wrong, the commit will fail. Remember that if a page fault occurs in a block which has been swapped out, it must be swapped back in again.

Overview
A typical sequence of events after an illegal address is accessed might go as follows:
1.Processor triggers page fault.
2.Control is passed to the kernel, which calls the page fault handler.
3.The page fault handler looks through the list of allocated blocks for the current process.
3.1.A block has been found which spans the faulting address; the “commit” function is called.
3.2.The commit function allocates a physical page for each page in the block’s range.
3.3.If the block was in the swap file, each page of the block is read in from the swap file.
3.4.The relevant mappings are made in the current process’s address space, according to the protection flags given when the block was first allocated.
3.5.Control is passed back to the application.
4.The page fault handler looks through the list of executable modules for the current process.
4.1.An module has been found which spans the faulting address; the “dynamic load” function is called.
4.2.The dynamic load function looks for a section in the module which spans the faulting address.
4.2.It looks for a copy of that section already loaded into another process’s address space.
4.2a.1. Another copy of the section has been found. Its reference count is incremented.
4.2a.2. The physical memory for that section is mapped into the current address space read-only. A later write to the section will incur another page fault; then, the physical memory for the section is copied and unshared.
4.2b.1. The section has not been loaded anywhere else.
4.2b.2. Physical memory for the section is allocated and mapped into the current address space.
4.2b.3. The section is loaded from disk into memory.
4.3.Control is passed back to the application.
5.No valid block has been found; the current process is either terminated or control is passed to its exception/signal handlers.

I’ve gone through the major functionality of a reasonably-sophisticated virtual memory manager. The VMM is able to split up applications’ address spaces into reasonable chunks, and the kernel malloc() makes writing the rest of the kernel a lot easier. But so far all we’ve got is still a bloated “Hello, world” program, albeit one which uses protected mode and which can allocate a full 4GB of address space. No operating systems ever sold themselves on the ability to switch from one archaic processor mode to another which became public in 1986, so it’s time to give the kernel some work to do and start scheduling some code.

Page Tables
by Mike Rieker

Page Tables
This discussion makes reference to Intel hardware, but is generally applicable to other processors that implement pagetables.

When an usermode application runs, it addesses memory using virtual addresses. So the value in a typical C pointer is a virtual address. The application has no idea where in physical memory it actually is, and for the vast majority of cases, it really doesn't need to.

This has been necessary so that you can, for example, have two users doing a 'copy' command at the same time. Each 'copy' command has its own set of physical memory pages that it is running in, but each 'copy' command is running at the same virtual address as the others. This gives us these advantages:
1) It can be linked to run at the one virtual address and not have to be adjusted for a different address depending on where in physical memory it happens to load.
2) It only has access to it's virtual address space and can't interfere with other 'copy' commands, or any other command for that matter.

So now the CPU is running that 'copy' command. And let's say the copy command is written in C, and a routine in there has a pointer variable. I mentioned that all addresses, such as those in a pointer are virtual addresses. Somehow the CPU has to translate all virtual addresses to an actual physical memory addresses so it can read or write the memory. It performs a translation of a virtual address to the corresponding physical address by way of the page table.

A pagetable is basically an array of pagetable entries that is indexed by a virtual address. Each entry of the pagetable contains items such as:
1) Whether or not the page is accessible at all
2) If so, the corresponding physical address
3) Allowable access modes (user, kernel, read, write)

Now I said the pagetable is an array that is indexed by a virtual address. Let's say our virtual addresses are 32 bits (as in an Intel pentium type chip). It would be impractical to have a pagetable that had 2^32 entries. So instead of using all 32 bits of the virtual address as an index, only the most significat 20 bits are used, the lower 12 bits are ignored. So that means there is one pagetable entry for 4K worth of virtual addresses. For example, virtual addresses 00003012 and 000039EC both reference the same pagetable entry, specifically index 3. It is said then that Pentiums have a page size of 4K, that is, one pagetable entry (PTE) maps 4K of virtual addresses onto 4K physical addresses. The low 12 bits of the virtual address that were discarded earlier are now used to index the specific byte in the 4K physical memory page.

Most programs do not need all 2^32 bits of possible virtual address space to execute. They would have to be 4 gigabytes in size to do that! Usually they are at most a few megabytes. This implies then that most of the pagetable contains entries indicate nothing is supposed to be at a given virtual address. If we needed to supply a pagetable entry (PTE) for each 4K page in that 4G address space, that is 1Meg entries. Each PTE takes a long, so you need 4Meg just for the pagetable. Well, the designers of the Pentium realized that would be silly, as back then, that's all the memory there was in the whole computer. So there is a pagetable page for the pagetable pages themselves, called a pagetable directory page. This means you only have to have a page of pagetable entries for addresses you are actually using. So if a page of pagetable entries is all unused, you don't have to supply a pagetable page. Each entry of the pagetable directory page has a flag saying whether or not there is a corresponding pageta le page, and if so, the physical address of the corresponding pagetable page.

The virtual address now breaks down into:
Bits 31..22: index into the pagetable directory page
Bits 21..12: index into the pagetable page pointed to by the directory entry selected by bits 31..22
Bits 11..00: offset in the physical page pointed to by the pagetable entry selected by bits 21..12

The above is Intel specific, but other processors have similar stuff. Like Alphas have 8K pages, and IIRC, they have 3 levels of pagetable pages, instead of just 2 (because they have 64-bit virtual addresses). VAX's do it a completely different way, but the result is the same.

When we say that a particular process is active on a CPU, we are saying that that process' pagetables are being accessed by that CPU. In the Pentiums, this is done by loading the physical address of the pagetable directory in %CR3. So to switch processes on a Pentium, all one has to do is change the contents of %CR3 and you are now executing with different pagetables.

One more thing. The Intel hardware requires only that the pagetable directory and the pagetable pages exist in physical memory. It does not require that the pagetable stuff actually be mapped to virtual memory, it will run fine without that. However, it will make your coding miserable (and inefficient). So you *should* map the pagetables themselves to virtual addresses. In my OS, I put them at the high end. So the pagetable directory page for a given process is at that process' virtual address FFFFExxx. (I map FFFFFxxx to no-access to catch bad pointers at the high end). The pagetable pages are just below that and occupy virtual addresses FFCFE000 to FFFFDFFF (4 Meg). Just work out how to do it with pencil and paper, and in a few short hours of pure joy you'll have it.

Pagefaults
Now we have to fill in our pagetable with something useful. Let's say someone types in a 'copy' command or something like that. What happens (in regards to pagetables)?
1) The OS creates a pagetable directory page for the new process
2) The OS creates pagetable pages sufficient to point to the memory pages for the copy executable
3) The OS allocates memory pages for the copy executable
4) The OS reads the copy executable into those pages
5) Additional pagetable entries and memory pages are added for stack and heap memory

Now OS's usually don't do this for the whole image at once, they typically just set up structures to facilitate this. The reason for this is you may have a large image (such as MS Word), but you are only going to use a small part of it this time. So why load in all 10Meg when you are only going to be using say 1.5Meg? And if we only had 8Meg of ram available, it wouldn't even fit in there all at the same time!

So what an OS does is:
1) Create the pagetable directory page, with all entries marked as 'not-present', ie, there are no associated pagetable pages yet
2) Create a kernel struct that says where on disk and how big the executable is, and what virtual address it is to be loaded at
3) Create a kernel struct that says how big the stack is and what virtual address it starts at. Same for heap memory.

Now when a CPU tries to execute instructions or data in that image, the CPU will not be able to access it, because the pagetable directory contains a null entry indicating the instructions or data are not in memory. So what the CPU does is generate a Page Fault exception. This passes the virtual address of the attempted access, and whether it was a read or a write, and whether it was by kernel mode or user mode, to a Page Fault Exception handler, which (usually) is a kernel mode routine that you are going to supply. A prototype for the handler might look something like:

int pagefault_exception_handler (void *virtual_address,
int writing,
int usermode)

Input:
virtual_address = address that the program was trying to access but couldn't because either:
1) the directory indicates there is no pagetable page present
2) the pagetable entry indicates no physical page is allocated
3) the pagetable entry indicates that the access attempt is illegal (eg, writing to a page that is marked read-only)

writing = 0: the program is attempting to read the memory location
1: the program is attempting to write to the memory location

usermode = 0: the routine was executing in kernel mode
1: the routine was executing in user mode

Output:
pagefault_exception_handler = SUCCESS : the page was faulted in and the instruction should be retried
else : unable to fix it, signal a pagefault exception

So the pagefault handler routine looks at the kernel structs that were created to determine whether the access was legitimate, ie, is there supposed to be something there or not. If not, typically, it signals the exception (like a 'segment violation' in Linux). If there is supposed to be something there, that's when it allocates a pagetable page (if there isn't one there already), and allocates the page for the instructions or data.

Now it needs to look at the struct to see what goes into the page. If the struct says it is part of the executable, then the page is read in from disk. If it is a stack or heap page, it is filled with zeroes (or some other fill pattern).

Once the page is all set up, the pagefault handler returns back to the instruction that caused the fault. The CPU re-executes the instruction, and this time, it succeeds, because the pagetable entry is all filled in.

Some details
So the things the hardware requires us to tell it in a pagetable entry are:

*	whether or not the page is in memory

*	if so, what physical page it is in

*	what kind of access is currently allowed (read, write, user, kernel)

Now, it turns out that your OS may find it handy to save some other info in there, too. Fortunately, the hardware usually leaves unused bits in the pagetable entries just for this very purpose. If it doesn't, you'll have to make an array that parallels the pagetable array to store these extra bits.

So here are additional things I store in the pagetable entries:

*	software page state (valid and readable, valid and readable but dirty, valid and writable and dirty, page being read in from disk, page being written out to disk, page read/write from/to disk failed, page faulted out)

*	what the requested protection is (read, write, user, kernel)

Differs from current protection. Current protection should be 'no-access' when the page is faulted out, read-only when the page is being written out to disk, etc. Requested protection is what the page should be when it is fully realized.

Here are some low-level routines you will probably need for accessing pagetable entries:

read_pte : given a virtual address, determine if the corresponding pagetable entry exists in memory for the current process.
If not, return the virtual address of the missing pagetable entry (this will allow the caller to fault in the pagetable page then retry)
If it is in memory, return:
whether or not there is a memory page associated with it
if so, what physical page is there
current access modes
software page state
requested access modes

Now in my OS, my kernel's pagetable pages are always in memory, so I provide another routine, read_ptesys, that panics if the entry is not in memory, so I don't have to check the return value.

write_pte: given a virtual address, write the corresponding pagetable entry
For my OS, this routine could assume the pagetable page was in memory, as I always attempted to read the pagetable entry before writing it.

Translation Buffers
You may wonder if the CPU actually goes out and reads the pagetable entry for every memory access so it can translate the virtual address to the physical address. Well, they don't. CPU's typically have a special cache where they save the most recently used pagetable entries so they don't have to constantly go out and read pagetable entries.

These caches, typically, are somewhat 'manually operated'. The CPU will automatically load entries into the cache, but if you change a pagetable entry, it won't unload or update the corresponding cache entry. They do provide an instruction to unload it, though. So this just means your 'write_pte' subroutine has to also invoke this instruction after it sets up the new pagetable entry. For Intel chips, the magic instruction is INVLPG. The cache also gets completely flushed when you change which pagetable you are using, ie, you write %CR3.

Multiprocessor stuff
Now if you are running on an multiprocessor system, things can get a little messy. Let's say two (or more) CPU's are executing different threads in the same process. Suppose one of the CPU's does something like an 'munmap' call that removes pages from the virtual address space and frees the physical pages for use by other users. So it executes INVLPG instructions for itself to remove those bad cache entries. It also has to tell any of the other CPU's that are mapped to this pagetable to remove their bad cache entries, too! Typically, this is done (in Intel processors), by setting a value in memory indicating which virtual address to invalidate, then issuing an NMI to the other processors, and then waiting for them to acknowledge that they have done the invalidate.

Now there is a little trick you can use to minimize NMI-ing everytime you write an pte. Suppose, again, we have two CPU's executing different threads in the same process. And one of them gets a pagefault because part of the executable needs reading in. That pagetable entry will be 'no-access' because there is nothing there in memory. So the faulting CPU will read the page in from disk and set up the new pagetable entry, and issue an INVLPG to itself. Now it doesn't have to NMI the other CPU's. If the other CPU attempts to access that same page, the worst that will happen is it will get a pagefault too. Then your pagefault handler will see that some other CPU faulted the page in and it's ok, so it just does an INVLPG on itself and returns. So it turns out the only time you need to do the NMI thing is when you are 'downgrading' the protection on a page, ie, you are disallowing some type of access that was previously allowed. Examples are changing a page from read/write to read-only, or marking an entry as non-existant.

So my OS actually has two pte write routines, pte_writecur which invalidates only the calling CPU's cache entry, and pte_writeall which invalidates all CPU's cache entries. I must call pte_writeall for downgrades in protection, pte_writecur can be called for upgrades only.

Оставьте свой комментарий !

Ваше имя:

Комментарий:

Оба поля являются обязательными

Автор	Комментарий к данной статье