|Memory Management 1
by Tim Robinson
Управление физической памятью - низкоуровневый компонент любой ОС.
В ядре может быть несколько менеджеров памяти , и каждый может работать на своем уровне .
Предполагается , что ядро работает с админскими правами и в защищенном режиме .
Для x86 это означает , что это уровень 0 (PL0) .
Когда управление передается из загрузчика в ядро , вы получаете полный контроль над системой.
Код , данные и стек разнесены при этом на разные сектора .
Вы напечатали на экране “Hello world” - и что же дальше ?
Задача менеджера памяти проста - разложить всю доступную память по полочкам .
Память мы разбиваем на страницы - “pages”, одна страница - 4KB .
Процессор x86 уникален в том смысле , что он оперирует с сегментами и страницами :
сначала идет ссылка на сегмент , а потом на страницу внутри него .
Большинство других архитектур разбивает адресное пространство на страницы .
А в x86-архитектуре существует принцип сегментации по признаку кода,данных и стека .
И нет никакой возможности перенести это на другие архитектуры , не поддерживающие сегментацию.
Не используя paging, на x86 можно получить доступ лишь к 16MB физической памяти.
В 16-битном пространстве имеется 24-битная шина, дающая доступ
лишь к 16MB. Не имея сегментации , мы ограничены 64KB.
С приходом 386 появилась возможность разбить память на 4KB pages.
У этого метода есть свои недостатки - например , невозможность выделить память менее 4 КБ.
Менеджер памяти делает вашу ось независимой от количества реальной физической памяти .
Начиная с пентиума , адресация становится 32-битной .
В хидер каждой выделяемой страницы мы положим определенную информацию .
Нам нужно иметь указатель на выделяемую страницу , мы должны читать из нее
и писать в нее .
Вначале нужно подумать о том , как разбить адресное пространство
на user и kernel regions . user regions будет состоять из кода и данных пользователя ,
kernel regions - из собственно ядра и драйверов .
Ядро должно быть недоступным для пользователя .
Ядро можно например положить в первый мегабайт памяти , пользовательское пространство - во второй .
Нужно скорректировать адрес страницы на корректный физический адрес .
Если мы имеем 256MB памяти , нам нужно 8192 байт для управления
65,536 страницами . При выделении памяти под процесс , нам нужно знать ее размер ,
идентификатор процесса и пользователя .
Для каждой страницы будет существовать бит , который указывает на то ,
выделена ли память или нет . В сумме мы будем иметь массив таких бит для всего множества страниц
Возможен другой механизм управления страницами - стековый ,
при котором адреса свободных невыделенных страниц помещаются в стек .
Для того чтобы определить количество доступной физической памяти ,
при загрузке надо вызвать прерывание BIOS 15h .
Далее , нам нужно будет выделить статический кусок памяти для управления всей
остальной памятью , и поместить его в конец ядра .
Мы знаем стартовый и конечный адрес ядра - его можно взять из GNU ld script.
Кусок памяти мы назовем bitmap и положим сразу после ядра .
Первой функцией менеджера памяти будет выделение физической страницы .
Страница будет промаркирована как выделенная и будет возвращен ее адрес .
Если свободные страницы заканчиваются , нужно свопировать на диск .
This is where the paging architecture starts to
get useful. In short (and the processor manual explains this better),
paging allows you to control the address space from software. You can
map pages anywhere you like; you can apply protection to pages; and you
can allocate and map pages on demand via page faults (or even emulate
memory access completely).
I’ll leave the MMU description to the hardware manuals, and for
this, I’ll stick to the x86 (although other architectures are fairly
similar). The 386 and above use a three-level translation scheme: the
PDBR (Page Directory Base Register, called CR3) contains the physical
address of a page directory. The page directory is a page in memory
split into 1,024 32-bit words, or Page Directory Entries, each of which
is the physical address of a page table. Each page table is, too, split
into 1,024 Page Table Entries; each of these words is the physical
address of a page in memory.
Note that, to span 4GB of 4096-byte pages, you don’t need 32 bits
for each physical address. The PDBR, PDEs and PTEs only need 20 bits
each, so the bottom 12 bits are used as flags. These flags can apply
protection (read/write and user vs. supervisor) to pages and page
tables, and they can mark individual pages and page tables as present
or not present. There is also an ‘accessed’ flag, which the processor
sets when a page or page table is accessed. By switching the PDBR you
can use different page directories and page tables in different
contexts, thereby causing separate applications to use different
Paging is immensely powerful, and its true benefits aren’t fully
realized until you start writing user applications to make use of the
address space. Now we need to write functions in our memory manager to
maintain the various page directories and page tables.
Remember that page directories and page tables are just normal
pages in memory; the processor needs to know their physical addresses.
There’s only one page directory per process (or per address space) and
there can be up to 1,024 page tables per page directory. We can
allocate these as normal from our physical allocator we wrote before.
Remember that page directory and page table addresses need to be
page-aligned: that is, the bottom 12 bits need to be zero (otherwise
they would interfere with the read/write/protection/accessed bits).
Before we enable paging we need valid a PDBR and page directory,
and a valid page table for wherever we’re executing. The current page
table needs to identity map the current EIP: when we set the paging bit
in CR0, the CPU will still be executing from the same address, so it
had better have some valid instructions at wherever it’s executing.
This is why it’s good to locate the kernel at whatever address it’s
going to end up once paging is enabled.
So before we enable paging, we need to allocate a page directory
and one page table. You could use the page allocator or you could
reserve giant (4096-byte) global variables: either way it’s the same.
If you’re using global variables for your page directory/page table, be
sure that they’re page-aligned.
Zero all the entries you’re not using. This will clear the present
bit so that accessing the addresses they cover will cause a page fault.
You could set the other bits to some distinctive pattern if you want
your page faults to look good. Remember that each PDE (and page table)
covers 4MB of your address space, which should be enough for your
kernel – if not, I’d like to see it. Each PTE covers only 4KB, so
you’ll probably need several PTEs to cover your kernel (again, if not,
I’d like to see it…). Assuming you’re not relocating your kernel
dynamically (and why? Since it’s the first user of the address space,
it’s not likely to clash), you can just poke numbers into the page
directory and page table you’ve allocated as appropriate. You’ll be
making your memory-mapping function more sophisticated later.
All of this needs to go into your memory manager initialization
routine, which I covered before. Since your memory manager should
separate all the machine-specific memory stuff from the rest of the
kernel, it’s polite to enable paging before your return, and it’s
polite to keep the address format the same (if not, then the return
address on the stack from the last function call will be wrong). So set
up your page directory and page tables, enable paging, do a far jump
(to reload CS) and reload the other selectors (SS, DS, ES, FS and GS)
as appropriate. Now that you’ve enabled paging it makes sense to use
segments whose base addresses are zero, although there’s no reason why
this should be mandatory.
With luck, we’re now running with paging enabled. All addresses now
are going through the page directory and page tables we established
before. Note that the rest of physical memory is not necessarily mapped
into our new address space – all that is necessary is the part of code
that the CPU is currently executing (in practice, you’d map the whole
kernel: data, stack and all). We’d like to map in video memory so we
can print some more messages, and the full memory-mapping routine comes
At first glance this appears pretty easy. Just
poke words into the right pages in the current page directory and page
table; allocate a new page table if required. However, the CPU uses
physical addresses for page directories and page tables; we’re using
virtual addresses for everything. There’s a number of ways around this:
1) Map all physical memory into the address space. This can either
be done 1:1 (that is, physical memory is addressed by the bottom of the
address space) or at some offset (that is, physical memory is
accessible starting at, say, 0xD0000000). This approach’s advantage is
its simplicity (Win9x uses this); however, its disadvantage is the fact
that the user may have any amount of memory installed in their system,
all of which must be addressable. Imagine if the user had 4GB
installed: there would be no address space left…
2) Map each page into the address space and keep track of their virtual
addresses in a virtual page directory parallel to the real one. The
virtual page directory can store the virtual addresses of each of the
page tables while the real page directory stores their physical
addresses. This is good if the only pieces of physical memory which
must be addressed directly are the page directories/page tables;
however, it increases the amount of space taken up just by mapping –
not good in a small system.
3) Map the page directory into itself. This might seem like a kind of
weird fractal memory mapper, but it works well in practice. By setting
one of the fixed PDEs to the physical address of the associated page
directory, you can address PDEs and PTEs as separate addresses. If you
set element 1023 of each page directory to the physical address of the
page directory itself, the processor will see the page directory as the
last page table. It will see the PDEs as PTEs, and it will see the PTEs
as individual 32-bit words in the top 4MB of the address space. You can
use the top 4KB of the address space as the entries in the original
page directory. This has the advantage of being beautiful yet simple;
it has the disadvantage that you can only access page mappings inside
the current address space.
By way of an example, Windows NT maps up to 512MB of physical memory
into the kernel’s address space (as in option 1) while mapping the page
tables directly into the address space (as in option 3). Personally,
I’d go for the third option, even though it takes some thinking about
to get your head round it. The first option also has its advantages for
a simple kernel. Either way, the page mapper’s job is simple. Remember
to apply the correct protection at each stage: on the PTE, apply the
desired protection for that page; on the PDE, apply the desired
protection for that 4MB region. Each PDE should normally be made
read/write and user-visible unless you have a good reason for making
the whole 4MB region inaccessible from user mode.
That’s about it for our simple physical memory
manager. If you want HAL-style machine abstraction you’ll probably need
an address space switching function callable from the scheduler (on the
x86, this is just MOV CR3, addr). If you’re going for a full VMS- or
NT-style asynchronous device architecture you’ll need a routine which
can lock a buffer in the virtual address space and record the physical
pages associated with it; however, that comes under ‘advanced’ design
and you probably won’t need it for a simple Minix or Linux model.
Before we can use our address fully, we’ll need a more
sophisticated memory manager, so here’s a look forward to part 2 of the
memory management tutorial.
|Memory Management 2
by Tim Robinson
You might have already read my “Memory Management 1” tutorial; if not,
then you should go and read it now, because this tutorial relies on it.
Here I’ll be describing the next level of memory management: turning
the memory manager I wrote about in the first tutorial into something
useful and general-purpose.
Remember that the low-level memory manager deals with the
computer’s physical memory: that is, the raw address space of your PC
before things like paging come into effect. The low-level manager has
three main components: an allocator (for allocating one physical page
at a time), a deallocator (for deallocating those pages) and a memory
mapper (for modifying mappings between the virtual address space and
physical memory). Now it’s time to make something useful out of these.
There are a few problems with the physical memory functions which
make them unsuitable for everyday memory allocation like, say,
allocate whole pages at a time. This is fine if you only ever want to
allocate memory in multiples of 4KB (on the x86), but not much use if
you’re using things like linked lists (which you will be).
|They deal with
physical addresses, not virtual. This means that they don’t naturally
take advantage of a paged architecture, which would allow your kernel
to arrange applications’ address spaces independently of the layout of
the computers RAM chips.
|They don’t do
much with the memory they allocate: they don't allow things like
sharing, swapping to disk, memory-mapped files, integration with the
disk cache, etc.
What we need to do is write a layer of functions on top of the physical
memory manager which deal with these problems. Note that this memory
manager, like most of the kernel, can be made portable: the things it
deals with are generally the same across different architectures (that
is, until you start dealing with things like the Z80 and the 80286).
What we’re going to do is, effectively, write two
memory managers: one which can manage the virtual address space for
user applications, and one which implements malloc() and free() (or the
equivalents in your language of choice) for your kernel. A virtual
memory manager is useful for providing a large-block memory interface
to your user applications: sure, it would be OK to expose the physical
memory manager directly to user mode, but that would mean that user
apps needed to know about the physical architecture of the target
machine, and surely we’re writing a kernel to avoid that kind of thing?
It’s good to write versions of malloc() and free() for the kernel
because it makes writing kernel (and driver) code easier and more
enjoyable. I’m a great believer in providing a good run-time library
interface in kernel mode, even if it means that the kernel grows in
size. Different kernels provide these RTL facilities to different
extents; the Linux kernel provides a small subset of the C RTL, and
Windows NT provides its own set of functions which mirror those
provided by the C standard. Linux calls its malloc() kmalloc(); Windows
NT has ExAllocatePool() and friends; I’ll be sticking with the familiar
malloc() and free()
I’ll start with these two because they are
probably the simplest to implement. The goal of any malloc() is to take
a large slab of memory and divide it up into pieces as requested, and
that is what we’ll be doing in the kernel. Here, we need to set aside a
portion of the kernel’s address space (remember, the kernel’s address
space ought to be separate from the user address space, in the
interests of memory protection) from where malloc()’s memory will come.
I like to split my kernel’s address space into several regions. The
MпїЅius kernel, for example, starts at address 0xC0000000; that is, it
occupies the top gigabyte. Kernel memory is split up into 256MB-regions
Kernel code, data, bss, etc.
Probably too much but, hey, we might see a 256MB kernel one day.
Space reserved for device drivers
Some physical memory (useful for video memory access)
Page directory and page tables of the current process
Note that 256MB of address space, from 0xD0000000 to 0xE0000000 is
devoted to the kernel’s heap. This doesn’t mean that 256MB of physical
memory will be given over to the kernel heap at all times; it just
means that there is a 256MB window through which heap memory is
I’m not going to go into telling you how to write the malloc() and
free() functions themselves; other people can do that better than me.
There is a sample implementation in the book The C Programming Language
by Kernighan and Ritchie, and there are several slot-in implementations
on the Internet. However, one thing all malloc() implementations should
have in common is a hook into the OS kernel to request more memory. In
the more conventional user-mode environment, this function (often
called morecore() or sbrk()) will ask the kernel for a large block of
memory in order to enlarge the process’s address space. Here, we are
the kernel, so we need to write our own morecore().
Incidentally, there is a slight difference between morecore() (as
featured in K&R) and sbrk() (as used by most Unix implementations):
void *morecore(size_t n)
Requests a block of size n bytes
(K&R uses multiples of the allocation header size) which might come
from anywhere in the application’s address space
char *sbrk(size_t d)
Adjusts the size of the application’s address space by d bytes and
returns a pointer to the start of the new region. Returns char* because
the earliest versions of C compilers had no void* type.
The difference is subtle but it makes sense to be clear about these
definitions. I’ll be using the morecore() interface here, but it should
be trivial to adapt it for sbrk(); in fact, there’s no reason why
morecore() can’t work the same way as sbrk() internally.
Remember: malloc() calls morecore() when it runs out of memory. If
it succeeds, the new block is added to the free block list (if it
fails, malloc() returns NULL). In the kernel we need to increase the
amount of space set aside for the kernel’s heap by n bytes and return a
pointer to the start of the block. But morecore() wants to return an
address inside the virtual address space, and so far we only know how
to allocate physical pages. Well, the obvious thing to do would be to:
1) Allocate a page using the physical memory allocator
2) Map it at the end of the kernel’s heap using the physical memory mapper
3) Adjust the kernel’s end-of-heap pointer
4) Repeat from (1) until enough memory has been allocated to satisfy the request
Note that there is no provision made to free the physical memory
associated with the heap. I don’t know of any malloc() implementation
which will ask the OS to free the memory is has been given (the RTL
usually assumes that the program’s memory will be freed when it exits)
but it should be possible to walk the free block heap periodically and
release any memory used. As is is, this algorithm will only allocate
memory as it is used: if a lot of memory is suddenly allocated and then
freed, it will remain allocated (as far as the kernel is concerned) but
it will be re-used next time malloc() is called.
That’s all there is to the kernel side of malloc(). Remember that
this malloc() is only for use by the kernel and by kernel-mode device
drivers. User applications can have their own funky implementation of
malloc(), or new, or GetMem() as required. These user-mode allocators
can call the kernel’s virtual memory manager, which I’ll conveniently
The Virtual Memory Manager
The kernel’s malloc() is all well and good; we can
go off and write strdup() and new and make linked list functions now.
But it’s not very sophisticated. We can make a much more powerful
allocator which can do more than allocate and free arbitrary small
blocks of memory: it can do all the things I talked about in the
Introduction. This is where we write the virtual memory manager.
As with the physical memory manager from part 1, the virtual memory
manager will be managing the address space. However, this address space
is the virtual one: it is a constant 4GB long (on a 32-bit machine) and
it needn’t correspond directly with the underlying memory installed the
machine. Parts of it can be mapped to the same physical memory (to
share memory); parts of it can be mapped nowhere at all (to provide
guard pages, for example, at the end of stacks); parts of it can be
unmapped but can emulate real memory (like in Windows NT’s ntvdm.exe,
which emulates the PC’s BIOS Data Area for real-mode apps).
The key to all of this is the page fault. If you’ve done much
Windows programming you’ll probably be familiar with the page fault
(along with the access violation) as one of the symptoms of a spurious
pointer; Linux programmers (and users) will know it as a segmentation
violation or bus fault. However, as applications programmers, we only
see the page fault when things go wrong. In reality, the page fault is
the processor’s way of telling the kernel that it doesn’t know how to
handle an access to a particular region of memory, and that some
software handler needs to be invoked to sort it out. Under the correct
circumstances, the kernel will be able to do so and continue the
execution of the application.
The most benign kind of page fault occurs when the processor tries
to read from or write to a page which has been marked as ‘not present’.
Remember that each entry in the page directory and each entry in each
page table has a ‘Present’ bit. The processor checks this bit on every
memory access (actually it will cache the PDEs and PTEs to speed up
memory access). If it is set then the access can continue; if not, it
writes the address in question to the register CR2 and invokes the page
fault handler. This is installed the same way as a normal interrupt
handler; see the interrupts tutorial for more details. I’ll assume that
you have got your interrupt handler sufficiently advanced that control
will be passed to a routine when an interrupt occurs.
So the page fault handler is executed when a page is accessed which
is, as far as the MMU is concerned, not present. The page fault handler
1) Check what should be located at the faulting address and,
2) Make sure the address is accessible if it should be, or,
3) Take the appropriate action if the address is incorrect
The "appropriate action" in this case will probably involve
terminating the application in question. Different operating systems
take different approaches over this; Windows NT invokes the current set
of Structured Exception Handlers, the default action of which is to
terminate the application; Linux invokes the appropriate signal
In order to check what should be located at the
faulting address we need to keep records of how the address space is
organized. To start with, we can write a simple virtual memory
allocator. Forget about page faults for a moment while I describe what
our virtual memory allocator will do.
In short, we need to be able to make good use of the 4GB given to
us by the hardware designers. This comes down to writing a routine
which can allocate an arbitrary amount of virtual memory. Note that the
physical memory behind it doesn’t need to be allocated all at once, and
it doesn’t need to be contiguous. Of course the physical memory does
eventually need to be there – the processor needs something to send
over the address bus to the main memory – but it’s not essential to
allocate it all at once. For now it’s enough to record all allocations
that take place in a list. Our malloc() can come in useful here because
you can allocate one record per allocation and store things like the
virtual address of the allocation and how many pages were allocated. In
fact, we need to record at least the allocation base address and the
size so that we can allocate (“commit”) physical memory later. It’s
usually more worthwhile dealing in pages here, rather than bytes,
because it makes the kernel’s accounting easier. Remember that the
user-mode RTL can implement a malloc()-style allocator itself which can
divide large blocks of memory into smaller ones. As a reasonable
minimum you ought to record for each allocation:
|Virtual base address (i.e. the start of the block in the current application’s address space)
|Protection (e.g. read/write, user vs. kernel)
At this point we have a function which can ‘mentally’ allocate
blocks of memory and record them somewhere in a list. If you’re
implementing process-based protection in your kernel you ought to keep
one list per process, to avoid confusion.
Once some application has a block allocated it can try to access
it. However, there’s no physical memory associated with the block yet –
the page table entries (and, in fact, the page directory entry) for
that region are marked as ‘not present’. So the processor invokes a
page fault, and calls our page fault handler.
The first thing the handler needs to do is check the allocation
list to find the faulting address. The x86 processor records the actual
address accessed, not the address of the start of the page, but this
shouldn’t make any difference here. We need to walk the list to find a
block which spans the faulting address.
Once the block is found, we need to commit it. This will involve
allocating the relevant physical pages and mapping them into the
address space, using the physical memory allocator from part 1. It’s up
to you whether you commit an entire block at once or commit it
page-by-page. Imagine an allocation of 4MB: that’s 1,024 pages. On the
one hand, the application might write one byte at the start and then
forget about it. If we committed the whole block on the first access,
we’d waste 3.99MB (that is, until the application freed the block).
Alternatively we could commit one page at once, and invoke the page
fault handler when each page was accessed. If the application allocated
the 4MB block and wrote to every byte we’d end up faulting 1,024 times:
each fault has its overhead which would be unnecessary if the whole
block was committed on the first fault. As with so many design
decisions, it’s up to you.
I mentioned the processor’s page table cache before. This is called
the Translation Lookaside Buffer and it is an internal copy of the PDEs
and PTEs as they were when the cache was last invalidated. The cache is
invalidated when CR3 is written to and when the INVLPG instruction is
executed (on the 486 and above). Invalidating the whole TLB takes a
while so it should only be done when necessary (e.g. when switching
between two processes). The INVLPG instruction is useful because it
instructs the processor to invalidate only the region of the TLB
associated with one page. The TLB must be updated when any changes are
made to the page directory or page tables (the processor doesn’t do it
automatically) so it is necessary after allocating and mapping the
physical memory here.
Assuming that the faulting address was correctly identified as part
of an allocated block, and the relevant physical memory was allocated
and mapped successfully, the kernel is free to return from the page
fault handler and continue execution of the current application. The
processor will retry the faulting instruction and, if the kernel has
done its job, all will be well.
If we load an executable program into memory,
we’ll have a record of the makeup of that executable – that is, what
sections it contains, where they are located, and how big they are.
They will be laid out in the application’s address space in some order,
and they needn’t all be loaded at startup. This is similar to the
virtual memory allocator I just described, except that we already have
a record of the executable’s sections in the header. Why not map the
file into memory as it appears on disk, headers and all. Then it
becomes unnecessary to allocate the executable’s memory with the
To do this we need to walk the list of executable image sections at
the same time as we check the list of allocated blocks. This is a
matter of starting at the base address of the executable (that is, the
address of the start of the headers), parsing the headers and locating
the section which spans the faulting address. It may be more efficient
to cache the sections’ details in some other format, but most
executable formats (including COFF, ELF and PE) keep it reasonable.
Then we can write a “separate executable section committer” which will
load an executable section from disk, allocate physical storage and map
it. We can make some more optimizations here – for example, if code
pages are read-only, it will be unnecessary to swap them to disk if
they are already present in the on-disk image. Read-only code pages can
also be shared among multiple instances, and data pages can be shared
until they are written to; once they are written to, a copy is made.
The principle of copying pages only when they are written to
("dirtied") is known as Copy-On-Write, and is commonly used.
The principle behind COW is that the originals of such pages are
made read-only. The processor will fault when the application attempts
to write to it – because they are data pages there’s no reason why the
application should avoid writing to them – and the error code on the
stack will reflect this. The page fault handler detects that the
faulting address is contained within a COW page, so it can make a copy
of the underlying physical memory, re-map it in the address space, and
allow the application to continue. Because the mapping has changed, the
application will continue with the copy, which the page fault handler
has now made read-write.
Copy-on-write is also used in the common Unix fork() paradigm. The
fork() system call, which has no parallel in standard Win32, creates a
copy of the current process which is identical – it uses the same code
pages, data pages and stack pages are shared, and execution continues
at the same location. To save memory, as many pages are shared as
possible. However, the new process and the old one are completely
separate, so each is free to write to its shared copy of the data. Here
COW is used to spawn new copies of the shared data when either process
(parent or child) modifies them.
Now that we’re able to allocate memory on demand,
we can do some more advanced things with our virtual memory manager.
Granted, we can only implement the more advanced features when we’ve
got things like a disk driver and a process manager going, but the
theory’s here. One useful feature is the ability of the virtual memory
manager to swap pages to and from disk. This is present in most modern
operating systems and it is responsible for the thrashing noise when
the hard disk is accessed in a low-memory situation. The theory is
simple: sooner or later, a block will need to be committed, and the
physical memory allocator will run out of pages. We need to clear out
some space for the new block: after all, the new block is being
accessed right now, and there’s probably blocks in the system which
haven’t been accessed for a while.
The swapper function needs to:
1) Find a block suitable for swapping out to disk. This will
usually be the least recently used one, although Win9x will swap
arbitrary blocks out randomly.
2) Make room for it in the swap file: talk to the file system and disk
drivers for this. A fixed contiguous swap file is simplest (пїЅla
Windows 3.1): at startup, record the disk sector numbers associated
with the swap file and talk directly to the disk driver when necessary.
Alternatively you could access the swap file as a normal file, via the
file system driver.
3) Write the old block to disk, and record that is has been swapped out
4) Repeat from (1) until the physical allocation request succeeds
Note that if any of this goes wrong, the commit will fail. Remember
that if a page fault occurs in a block which has been swapped out, it
must be swapped back in again.
A typical sequence of events after an illegal address is accessed might go as follows:
1.Processor triggers page fault.
2.Control is passed to the kernel, which calls the page fault handler.
3.The page fault handler looks through the list of allocated blocks for the current process.
3.1.A block has been found which spans the faulting address; the “commit” function is called.
3.2.The commit function allocates a physical page for each page in the block’s range.
3.3.If the block was in the swap file, each page of the block is read in from the swap file.
3.4.The relevant mappings are made in the current process’s
address space, according to the protection flags given when the block
was first allocated.
3.5.Control is passed back to the application.
4.The page fault handler looks through the list of executable modules for the current process.
4.1.An module has been found which spans the faulting address; the “dynamic load” function is called.
4.2.The dynamic load function looks for a section in the module which spans the faulting address.
4.2.It looks for a copy of that section already loaded into another process’s address space.
4.2a.1. Another copy of the section has been found. Its reference count is incremented.
4.2a.2. The physical memory for that section is mapped into the
current address space read-only. A later write to the section will
incur another page fault; then, the physical memory for the section is
copied and unshared.
4.2b.1. The section has not been loaded anywhere else.
4.2b.2. Physical memory for the section is allocated and mapped into the current address space.
4.2b.3. The section is loaded from disk into memory.
4.3.Control is passed back to the application.
5.No valid block has been found; the current process is either
terminated or control is passed to its exception/signal handlers.
I’ve gone through the major functionality of a reasonably-sophisticated
virtual memory manager. The VMM is able to split up applications’
address spaces into reasonable chunks, and the kernel malloc() makes
writing the rest of the kernel a lot easier. But so far all we’ve got
is still a bloated “Hello, world” program, albeit one which uses
protected mode and which can allocate a full 4GB of address space. No
operating systems ever sold themselves on the ability to switch from
one archaic processor mode to another which became public in 1986, so
it’s time to give the kernel some work to do and start scheduling some
by Mike Rieker
This discussion makes reference to Intel hardware, but is generally applicable to other processors that implement pagetables.
When an usermode application runs, it addesses memory using virtual
addresses. So the value in a typical C pointer is a virtual address.
The application has no idea where in physical memory it actually is,
and for the vast majority of cases, it really doesn't need to.
This has been necessary so that you can, for example, have two
users doing a 'copy' command at the same time. Each 'copy' command has
its own set of physical memory pages that it is running in, but each
'copy' command is running at the same virtual address as the others.
This gives us these advantages:
1) It can be linked to run at the one virtual address and not have
to be adjusted for a different address depending on where in physical
memory it happens to load.
2) It only has access to it's virtual address space and can't
interfere with other 'copy' commands, or any other command for that
So now the CPU is running that 'copy' command. And let's say the
copy command is written in C, and a routine in there has a pointer
variable. I mentioned that all addresses, such as those in a pointer
are virtual addresses. Somehow the CPU has to translate all virtual
addresses to an actual physical memory addresses so it can read or
write the memory. It performs a translation of a virtual address to the
corresponding physical address by way of the page table.
A pagetable is basically an array of pagetable entries that is
indexed by a virtual address. Each entry of the pagetable contains
items such as:
1) Whether or not the page is accessible at all
2) If so, the corresponding physical address
3) Allowable access modes (user, kernel, read, write)
Now I said the pagetable is an array that is indexed by a virtual
address. Let's say our virtual addresses are 32 bits (as in an Intel
pentium type chip). It would be impractical to have a pagetable that
had 2^32 entries. So instead of using all 32 bits of the virtual
address as an index, only the most significat 20 bits are used, the
lower 12 bits are ignored. So that means there is one pagetable entry
for 4K worth of virtual addresses. For example, virtual addresses
00003012 and 000039EC both reference the same pagetable entry,
specifically index 3. It is said then that Pentiums have a page size of
4K, that is, one pagetable entry (PTE) maps 4K of virtual addresses
onto 4K physical addresses. The low 12 bits of the virtual address that
were discarded earlier are now used to index the specific byte in the
4K physical memory page.
Most programs do not need all 2^32 bits of possible virtual address
space to execute. They would have to be 4 gigabytes in size to do that!
Usually they are at most a few megabytes. This implies then that most
of the pagetable contains entries indicate nothing is supposed to be at
a given virtual address. If we needed to supply a pagetable entry (PTE)
for each 4K page in that 4G address space, that is 1Meg entries. Each
PTE takes a long, so you need 4Meg just for the pagetable. Well, the
designers of the Pentium realized that would be silly, as back then,
that's all the memory there was in the whole computer. So there is a
pagetable page for the pagetable pages themselves, called a pagetable
directory page. This means you only have to have a page of pagetable
entries for addresses you are actually using. So if a page of pagetable
entries is all unused, you don't have to supply a pagetable page. Each
entry of the pagetable directory page has a flag saying whether or not
there is a corresponding pageta le page, and if so, the physical
address of the corresponding pagetable page.
The virtual address now breaks down into:
Bits 31..22: index into the pagetable directory page
Bits 21..12: index into the pagetable page pointed to by the directory entry selected by bits 31..22
Bits 11..00: offset in the physical page pointed to by the pagetable entry selected by bits 21..12
The above is Intel specific, but other processors have similar
stuff. Like Alphas have 8K pages, and IIRC, they have 3 levels of
pagetable pages, instead of just 2 (because they have 64-bit virtual
addresses). VAX's do it a completely different way, but the result is
When we say that a particular process is active on a CPU, we are
saying that that process' pagetables are being accessed by that CPU. In
the Pentiums, this is done by loading the physical address of the
pagetable directory in %CR3. So to switch processes on a Pentium, all
one has to do is change the contents of %CR3 and you are now executing
with different pagetables.
One more thing. The Intel hardware requires only that the pagetable
directory and the pagetable pages exist in physical memory. It does not
require that the pagetable stuff actually be mapped to virtual memory,
it will run fine without that. However, it will make your coding
miserable (and inefficient). So you *should* map the pagetables
themselves to virtual addresses. In my OS, I put them at the high end.
So the pagetable directory page for a given process is at that process'
virtual address FFFFExxx. (I map FFFFFxxx to no-access to catch bad
pointers at the high end). The pagetable pages are just below that and
occupy virtual addresses FFCFE000 to FFFFDFFF (4 Meg). Just work out
how to do it with pencil and paper, and in a few short hours of pure
joy you'll have it.
Now we have to fill in our pagetable with
something useful. Let's say someone types in a 'copy' command or
something like that. What happens (in regards to pagetables)?
1) The OS creates a pagetable directory page for the new process
2) The OS creates pagetable pages sufficient to point to the memory pages for the copy executable
3) The OS allocates memory pages for the copy executable
4) The OS reads the copy executable into those pages
5) Additional pagetable entries and memory pages are added for stack and heap memory
Now OS's usually don't do this for the whole image at once, they
typically just set up structures to facilitate this. The reason for
this is you may have a large image (such as MS Word), but you are only
going to use a small part of it this time. So why load in all 10Meg
when you are only going to be using say 1.5Meg? And if we only had 8Meg
of ram available, it wouldn't even fit in there all at the same time!
So what an OS does is:
1) Create the pagetable directory page, with all entries marked as
'not-present', ie, there are no associated pagetable pages yet
2) Create a kernel struct that says where on disk and how big the executable is, and what virtual address it is to be loaded at
3) Create a kernel struct that says how big the stack is and what virtual address it starts at. Same for heap memory.
Now when a CPU tries to execute instructions or data in that image,
the CPU will not be able to access it, because the pagetable directory
contains a null entry indicating the instructions or data are not in
memory. So what the CPU does is generate a Page Fault exception. This
passes the virtual address of the attempted access, and whether it was
a read or a write, and whether it was by kernel mode or user mode, to a
Page Fault Exception handler, which (usually) is a kernel mode routine
that you are going to supply. A prototype for the handler might look
int pagefault_exception_handler (void *virtual_address,
virtual_address = address that the program was trying to access but couldn't because either:
1) the directory indicates there is no pagetable page present
2) the pagetable entry indicates no physical page is allocated
3) the pagetable entry indicates that the access attempt is illegal (eg, writing to a page that is marked read-only)
writing = 0: the program is attempting to read the memory location
1: the program is attempting to write to the memory location
usermode = 0: the routine was executing in kernel mode
1: the routine was executing in user mode
pagefault_exception_handler = SUCCESS : the page was faulted in and the instruction should be retried
else : unable to fix it, signal a pagefault exception
So the pagefault handler routine looks at the kernel structs that
were created to determine whether the access was legitimate, ie, is
there supposed to be something there or not. If not, typically, it
signals the exception (like a 'segment violation' in Linux). If there
is supposed to be something there, that's when it allocates a pagetable
page (if there isn't one there already), and allocates the page for the
instructions or data.
Now it needs to look at the struct to see what goes into the page.
If the struct says it is part of the executable, then the page is read
in from disk. If it is a stack or heap page, it is filled with zeroes
(or some other fill pattern).
Once the page is all set up, the pagefault handler returns back to
the instruction that caused the fault. The CPU re-executes the
instruction, and this time, it succeeds, because the pagetable entry is
all filled in.
So the things the hardware requires us to tell it in a pagetable entry are:
|whether or not the page is in memory
|if so, what physical page it is in
|what kind of access is currently allowed (read, write, user, kernel)
Now, it turns out that your OS may find it handy to save some other
info in there, too. Fortunately, the hardware usually leaves unused
bits in the pagetable entries just for this very purpose. If it
doesn't, you'll have to make an array that parallels the pagetable
array to store these extra bits.
So here are additional things I store in the pagetable entries:
state (valid and readable, valid and readable but dirty, valid and
writable and dirty, page being read in from disk, page being written
out to disk, page read/write from/to disk failed, page faulted out)
|what the requested protection is (read, write, user, kernel)
Differs from current protection. Current protection should be
'no-access' when the page is faulted out, read-only when the page is
being written out to disk, etc. Requested protection is what the page
should be when it is fully realized.
Here are some low-level routines you will probably need for accessing pagetable entries:
read_pte : given a virtual address, determine if the corresponding pagetable entry exists in memory for the current process.
If not, return the virtual address of the missing pagetable entry
(this will allow the caller to fault in the pagetable page then retry)
If it is in memory, return:
whether or not there is a memory page associated with it
if so, what physical page is there
current access modes
software page state
requested access modes
Now in my OS, my kernel's pagetable pages are always in memory, so
I provide another routine, read_ptesys, that panics if the entry is not
in memory, so I don't have to check the return value.
write_pte: given a virtual address, write the corresponding pagetable entry
For my OS, this routine could assume the pagetable page was in
memory, as I always attempted to read the pagetable entry before
You may wonder if the CPU actually goes out and
reads the pagetable entry for every memory access so it can translate
the virtual address to the physical address. Well, they don't. CPU's
typically have a special cache where they save the most recently used
pagetable entries so they don't have to constantly go out and read
These caches, typically, are somewhat 'manually operated'. The CPU
will automatically load entries into the cache, but if you change a
pagetable entry, it won't unload or update the corresponding cache
entry. They do provide an instruction to unload it, though. So this
just means your 'write_pte' subroutine has to also invoke this
instruction after it sets up the new pagetable entry. For Intel chips,
the magic instruction is INVLPG. The cache also gets completely flushed
when you change which pagetable you are using, ie, you write %CR3.
Now if you are running on an multiprocessor
system, things can get a little messy. Let's say two (or more) CPU's
are executing different threads in the same process. Suppose one of the
CPU's does something like an 'munmap' call that removes pages from the
virtual address space and frees the physical pages for use by other
users. So it executes INVLPG instructions for itself to remove those
bad cache entries. It also has to tell any of the other CPU's that are
mapped to this pagetable to remove their bad cache entries, too!
Typically, this is done (in Intel processors), by setting a value in
memory indicating which virtual address to invalidate, then issuing an
NMI to the other processors, and then waiting for them to acknowledge
that they have done the invalidate.
Now there is a little trick you can use to minimize NMI-ing
everytime you write an pte. Suppose, again, we have two CPU's executing
different threads in the same process. And one of them gets a pagefault
because part of the executable needs reading in. That pagetable entry
will be 'no-access' because there is nothing there in memory. So the
faulting CPU will read the page in from disk and set up the new
pagetable entry, and issue an INVLPG to itself. Now it doesn't have to
NMI the other CPU's. If the other CPU attempts to access that same
page, the worst that will happen is it will get a pagefault too. Then
your pagefault handler will see that some other CPU faulted the page in
and it's ok, so it just does an INVLPG on itself and returns. So it
turns out the only time you need to do the NMI thing is when you are
'downgrading' the protection on a page, ie, you are disallowing some
type of access that was previously allowed. Examples are changing a
page from read/write to read-only, or marking an entry as non-existant.
So my OS actually has two pte write routines, pte_writecur which
invalidates only the calling CPU's cache entry, and pte_writeall which
invalidates all CPU's cache entries. I must call pte_writeall for
downgrades in protection, pte_writecur can be called for upgrades only.