Search     or:     and:
  Краткое описание
 W. R. Стивенс TCP 
 W. R. Стивенс IPC 
 K. Bauer 
 Gary V. Vaughan 
 Д Вилер 
 В. Сталлинг 
 Pramode C.E. 
 Steve Pate 
 William Gropp 
 С Бекман 
 Р Стивенс 
 Mendel Cooper 
 М Перри 
 C.S. Rodriguez 
 Robert Love 
 Daniel Bovet 
 Д Джеф 
 G. Kroah-Hartman 
 B. Hansen 
Последние статьи :
  Тренажёр 16.01   
  Эльбрус 05.12   
  Алгоритмы 12.04   
  Rust 07.11   
  Go 25.12   
  EXT4 10.11   
  FS benchmark 15.09   
  Сетунь 23.07   
  Trees 25.06   
  Apache 03.02   
TOP 20
 Rodriguez 7...777 
 Secure Programming for Li...721 
 Part 3...692 
 William Gropp...670 
 Httpd-> История Ap...630 
 Ethreal 4...604 
 Steve Pate 1...596 
 Rodriguez 6...595 
 Kernel 3.4...593 
 Stewens -> IPC 4...584 
 Ext4 FS...580 
 Daniel Bovet 1...576 
 Daniel Bovet 6...562 
 Rodriguez 8...560 
 libpcap->Package capture...558 
 MySQL & PosgreSQL...545 
  01.12.2022 : 3410524 посещений

Раздел III: Теория и практика кластеров

Список тем

Chapter 10: How to Build a Linux Enterprise Cluster
Chapter 11: The Linux Virtual Server: Introduction and Theory
Chapter 12: The LVS-NAT Cluster
Chapter 13: The LVS-DR Cluster
Chapter 14: The Load Balancer
Chapter 15: The High-Availability Cluster
Chapter 16: The Network File System

Часть 10: Как собрать Linux Enterprise Cluster

So far this book has focused on how to build a highly available pair of servers that can support mission-critical applications with no single point of failure. In this part of the book, we will look at how to build a cluster of servers that are all capable of running the same services to support the end users.

A cluster is more than a highly available pair of servers, because all of the nodes in the cluster can share the processing load. When one node goes down, all of the users that were logged on to that node are affected, and they can simply log back on again to reconnect to a working cluster node.

Steps for Building a Linux Enterprise Cluster

To build a Linux Enterprise Cluster, you need to do several things, each of which are outlined in this chapter:

  • Decide which NAS server you will use.

  • Understand the basic concepts of the Kernel Netfilter and kernel packet routing.

  • Learn how to clone a Linux machine.

  • Decide on a naming scheme for your cluster.

  • Learn how to apply system configuration changes to all cluster nodes.

  • Build a Linux Virtual Server Network Address Translation (LVS-NAT) cluster that uses a separate physical network for the cluster.

  • Build an LVS Direct Routing (LVS-DR) cluster.

  • Install software to automatically remove failed cluster nodes.

  • Install software to monitor the cluster.

  • Learn how to monitor the performance of the cluster nodes.

  • Learn how to update software packages on your cluster nodes and servers using an automated tool.

  • Decide which method you will use to centralize user account administration.

  • Install a printing system that supports the cluster.

  • Install a highly available batch job-scheduling system.

  • Purchase the cluster nodes.

NAS Server

If your cluster will run legacy mission-critical applications that rely on normal Unix lock arbitration methods (discussed in detail in Chapter 16) the Network Attached Storage (NAS) server will be the most important performance bottleneck in your cluster, because all filesystem I/O operations will pass through the NAS server.

You can build your own highly available NAS server on inexpensive hardware using the techniques described in this book,[1] but an enterprise-class cluster should be built on top of a NAS device that can commit write operations to a nonvolatile RAM (NVRAM) cache while guaranteeing the integrity of the cache even if the NAS system crashes or the power fails before the write operation has been committed to disk.


For testing purposes, you can use a Linux server as an NAS server. (See Chapter 16 for a discussion of asynchronous Network File System (NFS) operations and why you can normally only use asynchronous NFS in a test environment and not in production.)

Kernel Netfilter and Kernel Packet Routing

Before you can build a cluster load balancer, you'll need to understand how you can alter the fate of the network packets as they pass through the Linux kernel. The ability to alter the fate of the packet as it passes through the kernel allows you to build a cluster load balancer that distributes incoming requests for services across all cluster nodes. The command-line tools used to alter the fate of a packet are iptables, route, and the ip utility. You should be familiar with these tools before using them to build an enterprise-class cluster on top of them. (See Chapter 2 for more information on these tools.)

Cloning a Linux Machine

Chapters 4 and 5 describe a method for cloning a Linux machine using the SystemImager package. It would not be practical to build a cluster of nodes that are all configured the same way without using system-cloning software.

Cluster Naming Scheme

To automate the cloning process, each node should begin with a common string of characters followed by a sequential cluster-node number. For example, the first cluster-node host name could be clnode1, the second could be clnode2, and so forth.

Applying System Configuration Changes to all Nodes

A cluster administrator needs to know how to automatically apply system administration changes to all cluster nodes. This can be done using the SystemImager package's updateclient command. Before the cluster goes into production, practice using the updateclient command to apply changes made on the Golden Client to all of the cluster nodes (see Chapter 5 for more information).

Building an LVS-NAT Cluster

Building a Linux Virtual Server Network Address Translation (LVS-NAT) cluster will help you understand how the Linux Virtual Server (LVS) software works. It will also help you ensure that your load balancer is in charge of allocating nodes inside the cluster for inbound connection requests. (See Chapter 11 for an introduction to load balancing and Chapter 12 for an introduction to LVS-NAT.)

Building an LVS-DR Cluster

Once you know how to build an LVS-NAT cluster, you are ready to convert it to a Linux Virtual Server Direct Routing (LVS-DR) cluster, as described in Chapter 13. An enterprise-class cluster that is based on LVS-DR is superior to an LVS-NAT cluster for mission-critical applications for several reasons:

  • The LVS-DR cluster is easier to administer. LVS-DR cluster nodes can be administered from outside the cluster network using telnet and ssh to make a connection to the cluster nodes. In an LVS-NAT cluster, the physical network cabling or VLAN configuration prevents you from making direct connections to the cluster nodes.

  • The LVS-DR cluster can send replies from the cluster nodes back to the client computers without passing the packets through an intermediate machine (the load balancer).

  • The LVS-DR cluster load balancer can malfunction and not render all of the cluster nodes useless. In contrast, if the primary and backup LVSNAT cluster load balancers both crash at the same time, the entire LVS-NAT cluster is down. In an LVS-DR cluster, if the primary and backup LVS load balancers crash at the same time, the cluster nodes can still be used as separate or distributed servers. (See Chapter 13 for the details of how to build an LVS-DR cluster.) In practice, however, this is a selling point to management and not a "feature" of an LVS-DR cluster.


The Linux Enterprise Cluster should be protected from an outside attack by a firewall. If you do not protect your cluster nodes from attack with a firewall, or if your cluster must be connected to the Internet, shell access to the cluster nodes should be physically restricted. You can physically restrict shell access to the cluster nodes by building a separate network (or VLAN) that connects the cluster nodes to an administrative machine. This separate physical network is sometimes called an administrative network.[2]

Installing Software to Remove Failed Cluster Nodes

In this book we will use the ldirectord software package (included on the CD-ROM) to automatically remove nodes from the cluster when they fail. Chapter 15 will describe how to install and configure ldirectord.

As cluster administrator, you will also want to know how to manually remove a node from the cluster for maintenance purposes without affecting users currently logged on to the system. We'll look at how to do this in Chapter 19.

Installing Software to Monitor the Cluster Nodes

You cannot wade through the log files on every cluster node every day. You need monitoring software capable of sending email messages, electronic pages, or text messages to the administrator when something goes wrong on one of the cluster nodes.

Many open source packages can accomplish this task, and Chapter 17 describes a method of doing this using the Simple Network Management Protocol (SNMP) and the Mon software package. SNMP and the Mon monitoring package together allow you to monitor the cluster nodes and send an alert when a threshold that you have specified is violated.

Monitoring the Performance of Cluster Nodes

In addition to monitoring the cluster nodes for problems, you will also want to be able to monitor the processing load on each node to see if they are balancing the workload properly. The Ganglia package is an excellent tool for doing this. Chapter 18 will describe how to use Ganglia and will discuss a few of the performance metrics (such as the system load average) that Ganglia collects from all of the cluster nodes.

Managers and operations staff can use web pages created with the Ganglia software package to watch the processing load on the cluster in real time. The day your cluster goes into production, this will be one of the most important tools you have to see what is going on inside the cluster.

Updating Software on Cluster Nodes and Servers

You will also need a method of automatically downloading and installing packages to fix security holes as they are found and plugged. The automated tool Yum, described in Appendix D, is one way of doing this. You'll want to learn how to use Yum or another of the many automated package update utilities before going into production to make sure you have built a system that can continue to evolve and adapt to change.

You'll also need to take into account subtle problems, like the fact that the SystemImager updateclient command may overwrite the package registration information or the list of software (RPM) packages stored on the cluster node's disk drive. (To resolve this problem, you may want to install software on one node—the SystemImager Golden Client—and then just use the updateclient command to upgrade the remaining cluster nodes.)

Centralizing User Account Administration

You will need to somehow centralize the administration of user accounts on your cluster. See Chapters 1 and 19 for discussions of the possible account-administration methods, such as NIS, LDAP, Webmin, OPIUM (part of the OSCAR package), or a cron job that copies information stored on a central server to the local /etc/passwd file on each cluster node. (Whichever method you select, you will also want to decide if this is the right method for centralizing group and host information as well.)

Installing a Printing System

You will need to set up a printing system that allows you to have a single point of control for all print jobs from the cluster without creating a single point of failure. The use of the LPRng package for this purpose in a cluster environment will be briefly discussed in Chapter 19.

Installing a Highly Available Batch Job-Scheduling System

Building a highly available cluster will not improve the reliability of your system if your batch job scheduling system is not highly available. We'll look at how to build a batch job scheduling system with no single point of failure in Chapter 18.

Purchasing the Cluster Nodes

Clusters built to support scientific research can sometimes contain thousands of nodes and fill the data centers of large research institutions (see For the applications that run on these clusters, there may be no practical limit to the amount of CPU cycles they can use—as more cycles become available, more work gets done. By contrast, an enterprise cluster has a practical upper limit on the amount of processing cycles that an application can use. An enterprise workload will have periods of peak demand, where processing needs may be double, triple, or more, compared to processing needs during periods of low system activity. However, at some point more processing power does not translate into more work getting done, because external factors (such as the number and ability of the people using the cluster) will determine this limitation.

A Linux Enterprise Cluster will therefore have an optimal number of nodes, determined by the requirements of the organization using it, and by the costs of building, maintaining, and supporting it. In this section, we'll look at two basic design considerations for finding the optimal number of nodes: cluster performance and the impact of a single node failure.

Capacity Planning and Cluster Performance

Applications spend most of their idle time in one of four states: waiting for user input, waiting for the CPU, waiting for filesystem I/O, or waiting for the network I/O. When you build a cluster, you greatly reduce only one of these— the amount of time applications spend waiting for the CPU. By spreading users out over several cluster nodes, you also reduce the likelihood of several CPU-bound processes competing for the same CPU at the same time.

Most organizations can easily afford to purchase enough cluster nodes to eliminate the CPU as the performance bottleneck, and building a cluster doesn't help you prevent the other three performance bottlenecks. Therefore, the second cluster design consideration (the impact of a single node failure) is likely to influence capacity planning in most organizations more than performance.

Capacity Planning and the Impact of a Single Node Failure

The second and more significant design consideration for most organizations is the business impact of a node failure, and the ability to perform routine (planned) maintenance. When you are deciding how many nodes to purchase, the business impact of a single node failure on the enterprise or on the user community may be the single most important design consideration.

Purchasing more nodes than the number needed for peak CPU processing may make sense because the extra nodes will reduce the impact of the failure of a single node on the end-user community, and it will also make cluster maintenance easier. (The cluster administrator can remove a node from the cluster for maintenance, and the cluster will still have enough processing power to continue to get the job done.)

Assuming your budget will allow you to purchase more nodes than you need to adequately meet your CPU processing requirements, you'll need to examine how your workload will be distributed across the cluster nodes and determine the business impact of a node failure (for example, how many users would be affected or how many user sessions would be affected). You can then provide management with the total cost of ownership for each additional cluster node and explain the benefits of purchasing additional nodes.

[1]Techniques for building high-availability servers are described in this book, but building an NFS server is outside the scope of this book; however, in Chapter 4 there is a brief description of synchronizing data on a backup NAS server. (Another method of synchronizing data using open source software is the drbd project. See

[2]See the discussion of administrative networks in Evan Marcus and Hal Stern's Blueprints for High Availability (John Wiley and Sons, 2003).

[3]CPU and memory requirements, disk I/O, network bandwidth, and so on.

[4]Note that this will not work if your software applications have to contend with conversion endian issues. Sparc hardware, for example, uses big-endian byte ordering, while Intel hardware uses little-endian byte ordering. To learn more about endian conversion issues and Linux, see Gulliver's Travels or IBM's "Solaris-to-Linux porting guide" (

In Conclusion

Building a cluster allows you to remove the CPU performance bottleneck and improve the reliability and availability of your user applications. With careful testing and planning, you can build a Linux Enterprise Cluster that can run mission-critical applications with no single point of failure. And with the proper hardware capacity (that is, the right number of nodes) you can maintain your cluster during normal business hours without affecting the performance of end-user applications.

Часть 11: Linux Virtual Server: Введение и теория


This chapter will introduce the cluster load-balancing software called IP Virtual Server (IPVS). The IPVS software is a collection of kernel patches that were merged into the stock version of the Linux kernel starting with version 2.4.23. When combined with the kernel's routing and packet-filtering capabilities (discussed in Chapter 2) the IPVS-enabled kernel lets you turn any computer running Linux into a cluster load balancer. Together, the IPVS-enabled cluster load balancer and the cluster nodes are called a Linux Virtual Server (LVS).

The LVS cluster load balancer accepts all incoming client computer requests for services and decides which cluster node should reply to each request. The load balancer is sometimes called an LVS Director or simply a Director. In this book the terms LVS Director, Director, and load balancer all refer to the same thing.

The nodes inside an LVS cluster are called real servers, and the computers that connect to the cluster to request its services are called client computers. The client computers, the Director, and the real servers communicate with each other using IP addresses the same way computers have always exchanged packets over a network; however, to make it easier to discuss this network communication, the LVS community has developed a naming convention to describe each type of IP address based on its role in the network conversation. So before we consider the different types of LVS clusters and the choices you have for distributing your workload across the cluster nodes (called scheduling methods), let's look at this naming convention and see how it helps describe the LVS cluster.

LVS IP Address Name Conventions

In an LVS cluster, we cannot refer to network addresses as simply "IP addresses." Instead, we must distinguish between different types of IP addresses based on the roles of the nodes inside the cluster. Here are four basic types of IP addresses used in a cluster:

Virtual IP (VIP) address

  • The IP address the Director uses to offer services to client computers

Real IP (RIP) address

  • The IP address used on the cluster nodes

Director's IP (DIP) address

  • The IP address the Director uses to connect to the D/RIP network

Client computer's IP (CIP) address

The IP address assigned to a client computer that it uses as a source IP address for requests sent to the cluster

The Virtual IP (VIP)

The IP address that client computers use to connect to the services offered by the cluster are called virtual IP addresses (VIPs). VIPs are IP aliases or secondary IP addresses on the NIC that connects the Director to the normal, public network.[1] The LVS VIP is important because it is the address that client computers will use when they connect to the cluster. Client computers send packets from their IP address to the VIP address to access cluster services. You tell the client computers the VIP address using a naming service (such as DNS, DDNS, WINS, LDAP, or NIS), and this is the only name or address that client computers ever need to know in order to use the services inside the cluster. (The remaining IP addresses inside the cluster are not known to the client computer.)

A single Director can have multiple VIPs offering different services to client computers, and the VIPs can be public IP addresses that can be routed on the Internet, though this is not required. What is required, however, is that the client computers be able to access the VIP or VIPs of the cluster. (As we'll see later, an LVS-NAT cluster can use a private intranet IP address for the nodes inside the cluster, even though the VIP on the Director is a public Internet IP address.)

The Real IP (RIP)

In LVS terms, a node offering services to the outside world is called a real server. (We will use the terms cluster node and real server interchangeably throughout this book.) The IP address used on the real server is therefore called a real IP address (RIP).

The RIP address is the IP address that is permanently assigned to the NIC that connects the real server to the same network as the Director. We'll call this network cluster network or the Director/real-server network (D/RIP network). The Director uses the RIP address for normal network communication with the real servers on the D/RIP network, but only the Director needs to know how to talk to this IP address.

The Director's IP (DIP)

The Director's IP (DIP) address is used on the NIC that connects the Director to the D/RIP network. As requests for cluster services are received on the Director's VIP, they are forwarded out the DIP to reach a cluster node. As is discussed in Chapter 15, the DIP and the VIP can be on the same NIC.

The Client Computer's IP (CIP)

The client computer's IP (CIP) address may be a local, private IP address on the same network as the VIP, or it may be a public IP address on the Internet.

IP Addresses in an LVS Cluster

A schematic of an LVS cluster containing one cluster node (one real server) and one Director is shown in Figure 11-1.

Image from book
Figure 11-1: LVS cluster schematic

In this figure, the Director's public NIC, using the VIP address, is connected to the company network switch, or this could be an Internet router. The NIC connected to the D/RIP network has the DIP address on it. Figure 11-1 shows a D/RIP network hub or switch, though as we'll see, some cluster configurations can use the same network switch for both the public NIC and the cluster NIC (in LVS-DR clusters). Finally, the NIC that connects the real server to the D/RIP network is shown with a RIP address. Incoming packets destined for the real server arrive on the VIP, pass through the Director and out its DIP, and finally reach the real server's RIP.

[1]As you'll see in Chapter 15, the VIPs in a high-availability configuration are under Heartbeat's control.

Types of LVS Clusters

Now that we've looked at some of the IP address name conventions used to describe LVS clusters, let's examine the LVS packet-forwarding methods.

LVS clusters are usually described by the type of forwarding method the LVS Director uses to relay incoming requests to the nodes inside the cluster. Three methods are currently available:

  • Network address translation (LVS-NAT)

  • Direct routing (LVS-DR)

  • IP tunneling (LVS-TUN)

Although more than one forwarding method can be used on a single Director (the forwarding method can be chosen on a per-node basis), I'll simplify this discussion and describe LVS clusters as if the Director is only capable of using one forwarding method at a time.

The best forwarding method to use with a Linux Enterprise Cluster is LVS-DR (and the reasons for this will be explained shortly), but an LVS-NAT cluster is the easiest to build. If you have never built an LVS cluster and want to use one to run your enterprise, you may want to start by building a small LVS-NAT cluster in a lab environment using the instructions in Chapter 12, and then learn how to convert this cluster into an LVS-DR cluster as described in Chapter 13. The LVS-TUN cluster is not generally used for mission-critical applications and is mentioned in this chapter only for the sake of completeness. It will not be described in detail.

Network Address Translation (LVS-NAT)

In an LVS-NAT configuration, the Director uses the Linux kernel's ability (from the kernel's Netfilter code) to translate network IP addresses and ports as packets pass through the kernel. (This is called Network Address Translation (NAT), and it was introduced in Chapter 2).


We'll examine the LVS-NAT network communication in more detail in Chapter 12.

As shown in Figure 11-2, a request for a cluster service is received by the Director on its VIP, and the Director forwards this requests to a cluster node on its RIP. The cluster node then replies to the request by sending the packet back through the Director so the Director can perform the translation that is necessary to convert the cluster node's RIP address into the VIP address that is owned by the Director. This makes it appear to client computers outside the cluster as if all packets are sent and received from a single IP address (the VIP).

Image from book
Figure 11-2: LVS-NAT network communication

Basic Properties of LVS-NAT

The LVS-NAT forwarding method has several basic properties:

  • The cluster nodes need to be on the same network (VLAN or subnet) as the Director.

  • The RIP addresses of the cluster nodes normally conform to RFC 1918[2] (that is, they are private, non-routable IP addresses used only for intracluster communication).

  • The Director intercepts all communication (network packets going in either direction) between the client computers and the real servers.

  • The cluster nodes use the Director's DIP as their default gateway for reply packets to the client computers.

  • The Director can remap network port numbers. That is, a request received on the Director's VIP on one port can be sent to a RIP inside the cluster on a different port.

  • Any type of operating system can be used on the nodes inside the cluster.

  • A single Director can become the bottleneck for the cluster.

At some point, the Director will become a bottleneck for network traffic as the number of nodes in the cluster increases, because all of the reply packets from the cluster nodes must pass through the Director. However, a 400 MHz processor can saturate a 100 Mbps connection, so the network is more likely to become the bottleneck than the LVS Director under normal circumstances.

The LVS-NAT cluster is more difficult to administer than an LVS-DR cluster because the cluster administrator sitting at a computer outside the cluster is blocked from direct access to the cluster nodes, just like all other clients. When attempting to administer the cluster from outside, the administrator must first log on to the Director before being able to telnet or ssh to a specific cluster node. If the cluster is connected to the Internet, and client computers use a web browser to connect to the cluster, having the administrator log on to the Director may be a desirable security feature of the cluster, because an administrative network can be used to allow only internal IP addresses shell access to the cluster nodes. However, in a Linux Enterprise Cluster that is protected behind a firewall, you can more easily administer cluster nodes when you can connect directly to them from outside the cluster. (As we'll see in Part IV of this book, the cluster node manager in an LVS-DR cluster can sit outside the cluster and use the Mon and Ganglia packages to gain diagnostic information about the cluster remotely.)

Direct Routing (LVS-DR)

In an LVS-DR configuration, the Director forwards all incoming requests to the nodes inside the cluster, but the nodes inside the cluster send their replies directly back to the client computers (the replies do not go back through the Director).[3] As shown in Figure 11-3, the request from the client computer or CIP is sent to the Director's VIP. The Director then forwards the request to a cluster node or real server using the same VIP destination IP address (we'll see how the Director does this in Chapter 13). The cluster node then sends a reply packet directly to the client computer, and this reply packet uses the VIP as its source IP address. The client computer is thus fooled into thinking it is talking to a single computer, when in reality it is sending request packets to one computer and receiving reply packets from another.

Image from book
Figure 11-3: LVS-DR network communication

Basic Properties of LVS-DR

These are the basic properties of a cluster with a Director that uses the LVS- DR forwarding method:

  • The cluster nodes must be on the same network segment as the Director.[4]

  • The RIP addresses of the cluster nodes do not need to be private IP addresses (which means they do not need to conform to RFC 1918).

  • The Director intercepts inbound (but not outbound) communication between the client and the real servers.

  • The cluster nodes (normally) do not use the Director as their default gateway for reply packets to the client computers.

  • The Director cannot remap network port numbers.

  • Most operating systems can be used on the real servers inside the cluster.[5]

  • An LVS-DR Director can handle more real servers than an LVS-NAT Director.

Although the LVS-DR Director can't remap network port numbers the way an LVS-NAT Director can, and only certain operating systems can be used on the real servers when LVS-DR is used as the forwarding method,[6] LVS-DR is the best forwarding method to use in a Linux Enterprise Cluster because it allows you to build cluster nodes that can be directly accessed from outside the cluster. Although this may represent a security concern in some environments (a concern that can be addressed with a proper VLAN configuration), it provides additional benefits that can improve the reliability of the cluster and that may not be obvious at first:

  • If the Director fails, the cluster nodes become distributed servers, each with their own IP address. (Client computers on the internal network, in other words, can connect directly to the LVS-DR cluster node using their RIP addresses.) You would then tell users which cluster-node RIP address to use, or you could employ a simple round-robin DNS configuration to hand out the RIP addresses for each cluster node until the Director is operational again.[7] You are protected, in other words, from a catastrophic failure of the Director and even of the LVS technology itself.[8]

  • To test the health and measure the performance of each cluster node, monitoring tools can be used on a cluster node manager that sits outside the cluster (we'll discuss how to do this using the Mon and Ganglia packages in Part IV of this book).

  • To quickly diagnose the health of a node, irrespective of the health of the LVS technology or the Director, you can telnet, ping, and ssh directly to any cluster node when a problem occurs.

  • When troubleshooting what appear to be software application problems, you can tell end-users[9] how to connect to two different cluster nodes directly by IP (RIP) address. You can then have the end-user perform the same task on each node, and you'll know very quickly whether the problem is with the application program or one of the cluster nodes.


In an LVS-DR cluster, packet filtering or firewall rules can be installed on each cluster node for added security. See the LVS-HOWTO at for a discussion of security issues and LVS. In this book we assume that the Linux Enterprise Cluster is protected by a firewall and that only client computers on the trusted network can access the Director and the real servers.

IP Tunneling (LVS-TUN)

IP tunneling can be used to forward packets from one subnet or virtual LAN (VLAN) to another subnet or VLAN even when the packets must pass through another network or the Internet. Building on the IP tunneling capability that is part of the Linux kernel, the LVS-TUN forwarding method allows you to place cluster nodes on a cluster network that is not on the same network segment as the Director.


We will not use the LVS-TUN forwarding method in any recipes in this book, and it is only included here for the sake of completeness.

The LVS-TUN configuration enhances the capability of the LVS-DR method of packet forwarding by encapsulating inbound requests for cluster services from client computers so that they can be forwarded to cluster nodes that are not on the same physical network segment as the Director. For example, a packet is placed inside another packet so that it can be sent across the Internet (the inner packet becomes the data payload of the outer packet). Any server that knows how to separate these packets, no matter where it is on your intranet or the Internet, can be a node in the cluster, as shown in Figure 11-4.[10]

Image from book
Figure 11-4: LVS-TUN network communication

The arrow connecting the Director and the cluster node in Figure 11-4 shows an encapsulated packet (one stored within another packet) as it passes from the Director to the cluster node. This packet can pass through any network, including the Internet, as it travels from the Director to the cluster node.

Basic Properties of LVS-TUN

An LVS-TUN cluster has the following properties:

  • The cluster nodes do not need to be on the same physical network segment as the Director.

  • The RIP addresses must not be private IP addresses.

  • The Director can normally only intercept inbound communication between the client and the cluster nodes.

  • The return packets from the real server to the client must not go through the Director. (The default gateway can't be the DIP; it must be a router or another machine separate from the Director.)

  • The Director cannot remap network port numbers.

  • Only operating systems that support the IP tunneling protocol[11] can be servers inside the cluster. (See the comments in the configure-lvs script included with the LVS distribution to find out which operating systems are known to support this protocol.)

We won't use the LVS-TUN forwarding method in this book because we want to build a cluster that is reliable enough to run mission-critical applications, and separating the Director from the cluster nodes only increases the potential for a catastrophic failure of the cluster. Although using geographically dispersed cluster nodes might seem like a shortcut to building a disaster recovery data center, such a configuration doesn't improve the reliability of the cluster, because anything that breaks the connection between the Director and the cluster nodes will drop all client connections to the remote cluster nodes. A Linux Enterprise Cluster must be able to share data with all applications running on all cluster nodes (this is the subject of Chapter 16). Geographically dispersed cluster nodes only decrease the speed and reliability of data sharing.

[2]RFC 1918 reserves the following IP address blocks for private intranets:

  • through

  • through

  • through

[3]Without the special LVS "martian" modification kernel patch applied to the Director, the normal LVS-DR Director will simply drop reply packets if they try to go back out through the Director.

[4]The LVS-DR forwarding method requires this for normal operation. See Chapter 13 for more info on LVS-DR clusters

[5]The operating system must be capable of configuring the network interface to avoid replying to ARP broadcasts. For more information, see "ARP Broadcasts and the LVS-DR Cluster" in Chapter 13

[6]The real servers inside an LVS-DR cluster must be able to accept packets destined for the VIP without replying to ARP broadcasts for the VIP (see Chapter 13)

[7]See the "Load Sharing with Heartbeat—Round-Robin DNS" section in Chapter 8 for a discussion of round-robin DNS

[8]This is unlikely to be a problem in a properly built and properly tested cluster configuration. We'll discuss how to build a highly available Director in Chapter 15.

[9]Assuming the client computer's IP address, the VIP and the RIP are all private (RFC 1918) IP addresses

[10]If your cluster needs to communicate over the Internet, you will likely need to encrypt packets before sending them. This can be accomplished with the IPSec protocol (see the FreeS/WAN project at for details). Building a cluster that uses IPSec is outside the scope of this book.

[11]Search the Internet for the "Linux 2.4 Advanced Routing HOWTO" for more information about the IP tunneling protocol.

LVS Scheduling Methods

Having discussed three ways to forward packets to the nodes inside the cluster, let's look at how to distribute the workload across the cluster nodes. When the Director receives an incoming request from a client computer to access a cluster service on its VIP, it has to decide which cluster node should get the request. The scheduling methods the Director can use to make this decision fall into two basic categories: fixed scheduling and dynamic scheduling.


When a node needs to be taken out of the cluster for maintenance, you can set its weight to 0 using either a fixed or a dynamic scheduling method. When a cluster node's weight is 0, no new connections will be assigned to it. Maintenance can be performed after all of the users log out normally at the end of the day. We'll discuss cluster maintenance in detail in Chapter 19.

Fixed (or Non-Dynamic) Scheduling Methods

In the case of fixed, or non-dynamic, scheduling methods, the Director selects the cluster node to use for the inbound request without checking to see how many of the previously assigned connections are active. Here is the current list of fixed scheduling methods:

Round-robin (RR)

  • When a new request is received, the Director picks the next server on its list of servers, rotating through them in an endless loop.

Weighted round-robin (WRR)

  • You assign each cluster node a weight or ranking, based on how much processing load it can handle. This weight is then used, along with the round-robin technique, to select the next cluster node to be used when a new request is received, regardless of the number of connections that are still active. A server with a weight of 2 will receive twice the number of new connections as a server with a weight of 1. If you change the weight of a server to 0, no new connections will be allowed to the server (but currently active connections will not be dropped). We'll look at how LVS uses this weight to balance the incoming workload in the "Weighted Least-Connection (WLC)" section of this chapter.

Destination hashing

  • This method always sends requests for the same IP address to the same server in the cluster. Like the locality-based least-connection (LBLC) scheduling method (which will be discussed shortly), this method is useful when the servers inside the cluster are really cache or proxy servers.

Source hashing

  • This method can be used when the Director needs to be sure the reply packets are sent back to the same router or firewall that the requests came from. This scheduling method is normally only used when the Director has more than one physical network connection, so that the Director knows which firewall or router to send the reply packet back through to reach the proper client computer.

Dynamic Scheduling Methods

Dynamic scheduling methods give you more control over the incoming workload, with little or no penalty, since they only require a small amount of extra processing load on the Director. When dynamic scheduling methods are used, the Director keeps track of the number of active and inactive connections for each cluster node and uses this information to determine which cluster node to use when a new request arrives for a cluster service. An active connection is a TCP network session that remains open (in the ESTABLISHED state) while the client computer and cluster node are sending data to each other. In a Linux Enterprise Cluster, telnet or ssh sessions remain active as long as the user is logged on.[12]

An inactive connection, on the other hand, is any network connection that is not in the ESTABLISHED state. If a TCP inactivity timeout causes the connection to drop, or if the client computer sends a FIN packet to close the connection, LVS keeps the connection in the IPVS table for a brief period in case subsequent packets for the connection arrive to reestablish the TCP connection. This may happen, for example, when packets are resent due to transmission problems. The Director, in other words, attempts to protect the integrity of the connection between the client computer and the cluster node when there are minor network transmission problems.


This discussion is more theoretical than practical when using telnet or ssh for user sessions in a Linux Enterprise Cluster. The profile of the user applications (the way the CPU, disk, and network are used by each application) varies over time, and user sessions last a long time (hours, if not days). Thus, balancing the incoming workload offers only limited effectiveness when balancing the total workload over time.

As of this writing, the following dynamic scheduling methods are available:

Least-connection (LC)

  • Weighted least-connection (WLC)

  • Shortest expected delay (SED)

  • Never queue (NQ)

  • Locality-based least-connection (LBLC)

  • Locality-based least-connection with replication scheduling (LBLCR)

Least-Connection (LC)

With the least-connection scheduling method, when a new request for a service running on one of the cluster nodes arrives at the Director, the Director looks at the number of active and inactive connections to determine which cluster node should get the request.

The mathematical calculation performed by the Director to make this decision is as follows: For each node in the cluster, the Director multiplies the number of active connections the cluster node is currently servicing by 256, and then it adds the number of inactive connections (recently used connections) to arrive at an overhead value for each node. The node with the lowest overhead value wins and is assigned the new incoming request for service.[13]

If the mathematical calculation results in the same overhead value for all of the cluster nodes, the first node found in the IPVS table of cluster nodes is selected.[14]

Weighted Least-Connection (WLC)

The weighted least-connection scheduling method combines the least- connection method and a specified weight or ranking for each server to select the cluster node. (This is the default selection method if you do not specify one.) This method was intended for use in clusters with nodes that have differing processing capabilities.

The Director determines which cluster node to assign to a new inbound request for a cluster service by first calculating the overhead value (as described earlier in the discussion of the LC scheduling method) for each cluster node and then dividing this value by the weight you have assigned to the cluster node to arrive at a WLC value for each cluster node. The cluster node with the lowest WLC value wins, and the incoming request is assigned to that node.[15]

If the WLC value for all of the nodes is the same, the first node found in the list of cluster nodes is selected. (We'll talk more about this list, which is called the IPVS table, in the next three chapters.)

The WLC scheduling method is a good choice for a Linux Enterprise Cluster because it does a good job of balancing the workload of a typical enterprise.

Shortest Expected Delay (SED)

SED is a recent addition to the LVS scheduling methods, and it may offer a slight improvement over the WLC method for services that use TCP and remain in an active state while the cluster node is processing each request (large batch jobs are a good example of this type of request).

The SED calculation is performed as follows: The overhead value for each cluster node is calculated by adding 1 to the number of active connections. The overhead value is then divided by the weight you assigned to each node to arrive at the SED value. The cluster node with the lowest SED value wins.

There are two things to notice about the SED scheduling method:

  • It does not use the number of inactive connections when determining the overhead of each cluster node.

  • It adds 1 to the number of active connections to anticipate what the over- head will look like after the new incoming connection has been assigned.

For example, let's say you have two cluster nodes and one is three times faster than the other (one has a 1 GHz processor and the other has a 3 GHz processor[16]), so you decide to assign the slower machine a weight of 1 and the faster machine a weight of 3. Suppose the cluster has been up and running for a while, and the slower node has 10 active connections and the faster node has 30 active connections. When the next new request arrives, the Director must decide which cluster node to assign. If this new request is not added to the number of active connections for each of the cluster nodes, the SED values would be calculated as follows:

Slower node (1 GHz processor)

10 active connections / weight 1 = 10

Faster node (3 GHz processor)

30 active connections / weight 3 = 10

Because the SED values are the same, the Director will pick whichever node happens to appear first in its table of cluster nodes. If the slower cluster node happens to appear first in the table of cluster nodes, it will be assigned the new request even though it is the slower node.

If the new connection is first added to the number of active connections, however, the calculations look like this:

Slower node (1 GHz processor)

11 active connections / weight 1 = 11

Faster node (3 GHz processor)

31 active connections / weight 3 = 10.34

The faster node now has the lower SED value, so it is properly assigned the new connection request.

A side effect of adding 1 to the number of active connections is that a cluster node may sit idle even though multiple requests are assigned to another cluster node. For example, let's use our same two cluster nodes, but this time we'll assume the slower cluster node has no active connections and the faster node has one active connection. The SED calculation for each node looks like this (recall that 1 is added to the number of active connections):

Slower node (1 GHz processor)

1 active connection / weight 1 = 1

Faster node (3 GHz processor)

2 active connections / weight 3 = .67

So the new request gets assigned to the faster cluster node even though the slower cluster node is idle. This may or may not be desirable behavior, so another scheduling method was developed, called never queue.

Never Queue (NQ)

This scheduling method enhances the SED scheduling method by adding one new feature: if a cluster node has no active connections, it is always assigned the new incoming request for service, regardless of the result of the calculated SED values for each cluster node.

Locality-Based Least-Connection (LBLC)

Directors can also direct outbound traffic to a set of transparent proxy servers. In this configuration, the cluster nodes are transparent proxy or web cache servers that sit between client computers and the Internet.[17]

When the LBLC scheduling method is used, the Director attempts to send all requests destined for a particular IP address (a particular web server) to the same transparent proxy server (cluster node). In other words, the first time a request comes in for a web server on the Internet, the Director will pick one proxy server to service this destination IP address using a slightly modified version of the WLC scheduling method,[18] and all future requests for this same destination IP address will continue to go to the same proxy server. This method of load balancing, like the destination-hashing scheduling method described previously, is, therefore, a type of destination IP load balancing.

The Director will continue to send all requests for a particular destination IP address to the same cluster node (the same transparent proxy server) until it sees that another node in the cluster has a WLC value that is half of the WLC value of the assigned cluster node. When this happens, the Director will reassign the cluster node that is responsible for the destination IP address (usually an Internet web server) by selecting the least loaded cluster node using the modified WLC scheduling method.

In this method, the Director tries to associate only one proxy server to one destination IP address. To do this, the Director maintains a table of destination IP addresses and their associated proxy servers. This method of load balancing attempts to maximize the number of cache hits on the proxy servers, while at the same time reducing the amount of redundant, or replicated, information on these proxy servers.

Locality-Based Least-Connection with Replication Scheduling (LBLCR)

The LBLCR scheduling method (which is also a form of destination IP load balancing) attempts to improve on the LBLC scheduling method by maintaining a set of proxy servers that can service each destination IP address. When a new connection request comes in, the Director will select the proxy server with the fewest number of active connections from this set of servers.


See the file /proc/net/ip_vs_lblcr for the servers associated with each destination IP address on a Director that is using the LBLC scheduling method.

At the time of a new connection request from a client computer, proxy servers are added to this set for a particular destination IP address when the Director notices a cluster node (proxy server) that has a WLC[19] value equal to half of the WLC value of the least loaded node in the set. When this happens, the cluster node with the lowest WLC value is added to the set and is assigned the new incoming request.

Proxy servers are removed from the set when the list of servers in the set has not been modified within the last six minutes (meaning that no new proxy servers have been added, and no proxy servers have been removed). If this happens, the Director will remove the server with the most active connections from the set. The proxy server (cluster node) will continue to service the existing active connections, but any new requests will be assigned to a different server still in the set.


None of these scheduling methods take into account the processing load or disk or network I/O a cluster node is experiencing, nor do they anticipate how much load a new inbound request will generate when assigning the incoming request. You may see references to two projects that were designed to address this need, called feedbackd and LVS- KISS. As of this writing, however, these projects are not widely deployed, so you should carefully consider them before using them in production.

[12]Of course, it is possible that no data may pass between the cluster node and client computer during the telnet session for long periods of time (when a user runs a batch job, for example). See the discussion of the TCP session timeout value in the "LVS Persistence" section in Chapter 14.

[13]See the 143 lines of source code in the ip_vs_lc.c file in the LVS source code distribution (version 1.0.10).

[14]The first cluster node that is capable of responding for the service (or port number) the client computer is requesting. We'll discuss the IP Virtual Server or IPVS table in more detail in the next three chapters.

[15]The code actually uses a mathematical trick to multiply instead of divide to find the node with the best overhead-to-weight value, because floating-point numbers are not allowed inside the kernel. See the ip_vs_wlc.c file included with the LVS source code for details.

[16]For the moment, we'll ignore everything else about the performance capabilities of these two nodes.

[17]A proxy server stores a copy of the web pages, or server responses, that have been requested recently so that future requests for the same information can be serviced without the need to ask the original Internet server for the same information again. See the "Transparent Proxy with Linux and Squid mini-HOWTO," available at among other places.

[18]The modification is that the overhead value is calculated by multiplying the active connections by 50 instead of by 256.

[19]Again the overhead value used to calculate the WLC value is calculated by multiplying the number of active connections by 50 instead of 256.

In Conclusion

This chapter introduced the Linux Virtual Server (LVS) cluster load balancer called the LVS Director and the forwarding and scheduling methods it uses to pass incoming requests for cluster services to the nodes inside the cluster.

The forwarding methods used by the Director are called Network Address Translation, direct routing, and tunneling, or LVS-NAT, LVS-DR, and LVS-TUN respectively. Although the Director can select a different forwarding method for each node inside the cluster, most load balancers use only one type of forwarding method for all of the nodes in the cluster to keep the setup simpler. As I've discussed briefly in this chapter, an LVS-DR cluster is the best type of cluster to use when building an enterprise-class cluster.

The LVS scheduling methods were also introduced in this chapter. The Director uses a scheduling method to evenly distribute the workload across the cluster nodes, and these methods fall into two categories: fixed and dynamic. Fixed scheduling methods differ from dynamic methods in that no information about the current number of active connections is used when selecting a cluster node using a fixed method.

The next two chapters take a closer look at the LVS-NAT and LVS-DR forwarding methods.

Часть 12: LVS-NAT Cluster

Recall from the last chapter that the computers outside the cluster that access cluster services are called client computers, the cluster load balancer is called the Director, and nodes inside the cluster are called real servers.

In this chapter, I'll look at how client computers access services on real servers in a Linux Virtual Server Network Address Translation (LVS-NAT) cluster, and at how the Director uses the LVS-NAT forwarding method. I'll then give you a recipe for building an LVS-NAT web cluster.

How Client Computers Access LVS-NAT Cluster Resources

To understand how client computers access cluster services, we'll use the example of a client that is connected to the Internet and is calling up a web page offered by the LVS-NAT cluster. (We'll look at how to build this cluster in the "Building an LVS-NAT Web Cluster" section, later in this chapter.) Figure 12-1 shows the beginning of this network conversation.[1]

Image from book
Figure 12-1: In packet 1 the client computer sends a request to the LVS-NAT cluster

As you can see in the figure, the client computer initiates the network conversation with the cluster by sending the first packet from a client IP (CIP1), through the Internet, to the VIP1 address on the Director. The source address of this packet is CIP1, and the destination address is VIP1 (which the client knew, thanks to a naming service like DNS). The packet's data payload is the HTTP request from the client computer requesting the contents of a web page (an HTTP GET request).

When the packet arrives at the Director, the LVS code in the Director uses one of the scheduling methods that were introduced in Chapter 11 to decide which real server should be assigned to this request. Because our cluster network has only one cluster node, the Director doesn't have much choice; it has to use real server 1, which has the address RIP1.


The Director does not examine or modify the data payload of the packet.

In Chapter 14 we'll discuss what goes on inside the kernel as the packet passes through the Director, but for now we only need to know that the Director has made no significant change to the packet—the Director simply changed the destination address of the packet (and possibly the destination port number of the packet).


LVS-NAT is the only forwarding method that allows you to remap network port numbers as packets pass through the Director.

This packet, with a new destination address and possibly a new port number, is sent from the Director to the real server, as shown in Figure 12-2, and now it's called packet 2. Notice that the source address in packet 2 is the client IP (CIP) address taken from packet 1, and the data payload (the HTTP request) remains unchanged. What has changed is the destination address— the Director changed the destination of the original packet to one of the real server RIP addresses inside the cluster (RIP1 in this example). Figure 12-2 shows the packet inside the LVS-NAT cluster on the network cable that connects the DIP to the cluster network.

Image from book
Figure 12-2: In packet 2 the Director forwards the client computer's request to a cluster node

When packet 2 arrives at real server 1, the HTTP server replies with the contents of the web page, as shown in Figure 12-3. Because the request was from CIP1, packet 3 (the return packet) will have a destination address of CIP1 and a source address of RIP1. In LVS-NAT, the default gateway for the real server is normally an IP address on the Director (the DIP), so the reply packet is routed through the director.[2]

Image from book
Figure 12-3: In packet 3 the cluster node sends a reply back through the Director

Packet 3 is sent through the cluster network to the Director, and its data payload is the web page that was requested by the client computer. When the Director receives packet 3, it passes the packet back through the kernel, and the LVS software rewrites the source address of the packet, changing it from RIP1 to VIP1. The data payload does not change as the packet passes through the Director—it still contains the HTTP reply from real server 1. The Director then forwards the packet (now called packet 4) back out to the Internet, as shown in Figure 12-4.

Image from book
Figure 12-4: In packet 4 the Director forwards the reply packet to the client computer

Packet 4, shown in Figure 12-4, is the reply packet from the LVS-NAT cluster, and it contains a source address of VIP1 and a destination address of CIP1. The conversion of the source address from RIP1 to VIP1 is what gives LVS-NAT its name—it is the translation of the network address that lets the Director hide all of the cluster node's real IP addresses behind a single virtual IP address.

There are several things to notice about the process of this conversation:

  • The Director is acting as a router for the cluster network. [3]

  • The cluster nodes (real servers) use the DIP as their default gateway.[4]

  • The Director must receive all of the inbound packets destined for the cluster.

  • The Director must reply on behalf of the cluster nodes.

  • Because the Director is masquerading the network addresses of the cluster nodes, the only way to test that the VIP addresses are properly replying to client requests is to use a client computer outside the cluster.


The best method for testing an LVS cluster, regardless of the forwarding method you use, is to test the cluster's functionality from outside the cluster.

Virtual IP Addresses on LVS-NAT Real Servers

If you need to offer client computers access to different services on separate VIP addresses, cluster nodes in an LVS-NAT cluster can also use multiple RIP addresses (IP aliases or secondary IP addresses) to identify the VIP address the client computer used in the original request. As you can see in Figure 12-5, the Director can be configured to send packets to the cluster nodes (real servers) using multiple virtual RIP addresses. In this scenario, client computers will still only know the VIPs configured on the Director; the Director is responsible for sending the packets into the cluster through the proper virtual RIP address (using a load-balancing or scheduling method).

Image from book
Figure 12-5: An LVS-NAT cluster with multiple VIPs

You might choose this scenario to load balance different types of users on different IP addresses. For example, accounting and reporting users might use one IP address, and customer service and simple transaction-based users would use another. You may also want to use IP-based virtual hosts for your web pages (see Appendix F for a detailed discussion of Apache virtual hosting).

Figure 12-5 shows the Director using three VIPs to receive network requests from client computers. The Director uses the LVS-NAT forwarding method to select one of the virtual RIP addresses on the real servers inside the cluster (any other LVS forwarding method could be used, too). In this configuration, the VIP addresses would normally be associated with the virtual RIP addresses, as shown in Figure 12-6.

Image from book
Figure 12-6: Multiple VIPs and their relationship with the multiple virtual RIPs

Like the VIPs on the Director, the virtual RIP addresses can be created on the real servers using IP aliasing or secondary IP addresses. (IP aliasing and secondary IP addresses were introduced in Chapter 6.) A packet received on the Director's VIP1 will be forwarded by the Director to either virtual RIP1-1 on real server 1 or virtual RIP2-1 on real server 2. The reply packet (from the real server) will be sent back through the Director and masqueraded so that the client computer will see the packet as coming from VIP1.


When building a cluster of web servers, you can use name-based virtual hosts instead of assigning multiple RIPs to each real server. For a description of name-based virtual hosts, see Appendix F.

[1]For the moment, we will ignore the lower-level TCP connection establishment packets (called SYN and ACK packets).

[2]The packets must pass back through the Director in an LVS-NAT cluster.

[3]The Director can also act as a firewall for the cluster network, but this is outside the scope of this book. See the LVS HOWTO ( for a discussion of the Antefacto patch.

[4]If the cluster nodes do not use the DIP as their default gateway, a routing table entry on the real server is required to send the client reply packets back out through the Director.

Building an LVS-NAT Web Cluster

This recipe describes how to build an LVS-NAT web cluster consisting of a Director and a real server, using the Apache web server, as shown in Figure 12-7.

Image from book
Figure 12-7: LVS-NAT web cluster

The LVS-NAT web cluster we'll build can be connected to the Internet, as shown in Figure 12-7. Client computers connected to the internal network (connected to the network switch shown in Figure 12-7) can also access the LVS-NAT cluster. Recall from our previous discussion, however, that the client computers must be outside the cluster (and must access the cluster services on the VIP).


Figure 12-7 also shows a mini hub connecting the Director and the real server, but a network switch and a separate VLAN would work just as well.

Before we build our first LVS-NAT cluster, we need to decide on an IP address range to use for our cluster network. Because these IP addresses do not need to be known by any computers outside the cluster network, the numbers you use are unimportant, but they should conform to RFC 1918.[5] We'll use this IP addressing scheme:

LVS-NAT cluster network (10.1.1.)

LVS-NAT cluster broadcast address

LVS-NAT cluster subnet mask

Assign the VIP address by picking a free IP address on your network. We'll use a fictitious VIP address of

Let's continue with our recipe metaphor.

Recipe for LVS-NAT

List of ingredients:

  • 2 servers running Linux[6]

  • 1 client computer running an OS capable of supporting a web browser

  • 3 network interface cards (2 for the Director and 1 for the real server)

  • 1 mini hub with 2 twisted-pair cables (or 1 crossover cable)[7]

Step 1: Install the Operating System

When you install Linux on the two servers, be sure to configure the systems as web servers without any iptables (firewall or security) rules. The normal Red Hat installation process, for example, will automatically load Apache and create the /etc/httpd directory containing the Apache configuration files when you tell it that you would like your server to be a web server. Also, you do not need to load any X applications or a display manager for this recipe.

Step 2: Configure and Start Apache on the Real Server

In this step you have to select which of the two servers will become the real server in your cluster. On the machine you select, make sure that the Apache daemon starts each time the system boots (see Chapter 1 for a description of the system boot process). You may have to use the chkconfig command to cause the httpd boot script to run at your default system runlevel. (See Chapter 1 for complete instructions.)

Next, you should modify the Apache configuration file on this system so it knows how to display the web content you will use to test your LVS-NAT cluster. (See Appendix F for a detailed discussion.)

After you've saved your changes to the httpd.conf file, make sure the error log and access log files you specified are available by creating empty files with these commands:

 #touch /var/log/httpd/error_log
 #touch /var/log/httpd/access_log

The touch command creates an empty file.

Also make sure your DocumentRoot exists and contains an index.html file—this is the default file the web server will display when you call up the URL. Use these commands:

 #mkdir -p /www/htdocs
 #echo "This is a test (from the real server)" > /www/htdocs/index.html

Start Apache on the Real Server

You are now ready to start Apache using one of these commands:

 #/etc/init.d/httpd start


 #service httpd start

If the HTTPd daemon was already running, you will first need to enter one of these commands to stop it:

 /etc/init.d/httpd stop


 service httpd stop

The script should display this response: OK. If it doesn't, you probably have made an error in your configuration file (see Appendix F).

If it worked, confirm that Apache is running with this command:

 #ps -elf | grep httpd

You should see several lines of output indicating that several HTTPd daemons are running. If not, check for errors with these commands:

 #tail /var/log/messages
 #cat /var/log/httpd/*

If the HTTPd daemons are running, see if you can display your web page locally by using the Lynx command-line web browser:[8]

 #lynx -dump

If everything is set up correctly, you should see the contents of your index.html file dumped to the screen by the Lynx web browser:

 This is a test (from the real server)

Step 3: Set the Default Route on the Real Server

Real servers in an LVS-NAT cluster need to send all of their replies to client computers back through the Director. To accomplish this, we must set the default route for the real servers to the DIP. We can do so on Red Hat Linux by setting the GATEWAY variable in the /etc/sysconfig/network file.

Open the file in the vi text editor with this command:

 #vi /etc/sysconfig/network

Add, or set, the GATEWAY variable to the Director's IP (DIP) address with an entry like this:


We can enable this default route by rebooting (or re-running the /etc/ init.d/network script) or by manually entering it into the route table with the following command:

 #/sbin/route add default gw

This command might complain with the following error:

 SIOCADDTR: File exists

If it does, this means the default route entry in the routing table has already been defined. You should reboot your system or remove the conflicting default gateway with the route del command.

Step 4: Install the LVS Software on the Director

In this step you will make changes to the second server—the Director. (For the moment we are done making changes on the server you selected to be the real server.) The changes described here do not require you to reload your Linux distribution, but you will need to install a new kernel.

The Linux kernel is included on the CD-ROM that accompanies this book. This kernel contains the LVS software, but you'll need to compile and install this kernel with the LVS options enabled. You'll also need to install the ipvsadm utility to configure the Director. Older versions of the stock Linux kernel do not contain the LVS code, so you'll have to download the LVS patch and apply it to the kernel if you must use a kernel older than 2.4.23.

Copy the kernel source files included on the CD-ROM with this book (or download a kernel from and run the make menuconfig utility to enable the proper LVS and network options. Then use the instructions in Chapter 3 to compile and install this kernel on your system.


You can avoid compiling the kernel by downloading a kernel that already contains the LVS patches from a distribution vendor. The Ultra Monkey project at also has patched versions of the Red Hat Linux kernel that can be downloaded as RPM files.

Once you have rebooted your system on the new kernel,[9] you are ready to install the ipvsadm utility, which is also included on the CD-ROM. With the CD mounted, type these commands:

 #cp /mnt/cdrom/chapter12/ipvsadm* /tmp
 #cd /tmp
 #tar xvf ipvsadm*
 #cd ipvsadm-*
 #make install

If this command completes without errors, you should now have the ipvsadm utility installed in the /sbin directory. If it does not work, make sure the /usr/src/linux directory (or symbolic link) contains the source code for the kernel with the LVS patches applied to it.

Step 5: Configure LVS on the Director

Now we need to tell the Director how to forward packets to the cluster node (the real servers) using the ipvsadm utility we compiled and installed in the previous step.

One way to do this is to use the configure script included with the LVS distribution. (See the LVS HOWTO at for a description of how to use this method to configure an LVS cluster.) In this chapter, however, we will use our own custom script to create our LVS cluster so that we can learn more about the ipvsadm utility.


We will abandon this method in Chapter 15 when we use the ldirectord daemon to create an LVS Director configuration—ldirectord will enter the ipvsadm commands to create the IPVS table automatically.

Create an /etc/init.d/lvs script that looks like this (this script is included in the chapter12 subdirectory on the CD-ROM that accompanies this book):

 # LVS script
 # chkconfig: 2345 99 90
 # description: LVS sample script
 case "$1" in
            # Bring up the VIP (Normally this should be under Heartbeat's
            /sbin/ifconfig eth0:1 netmask up
 # Since this is the Director we must be
 # able to forward packets.[10]
            echo 1 > /proc/sys/net/ipv4/ip_forward
 # Clear all iptables rules.
            /sbin/iptables -F
 # Reset iptables counters.
            /sbin/iptables -Z
 # Clear all ipvsadm rules/services.
            /sbin/ipvsadm -C
 # Add an IP virtual service for VIP port 80
            /sbin/ipvsadm -A -t -s rr
 # Now direct packets for this VIP to
 # to the real server IP (RIP) inside the cluster
            /sbin/ipvsadm -a -t -r -m
         # Stop forwarding packets
         echo 0 > /proc/sys/net/ipv4/ip_forward
         # Reset ipvsadm
         /sbin/ipvsadm -C
         # Bring down the VIP interface
         ifconfig eth0:1 down
         echo "Usage: $0 {start|stop}"

If you are running on a version of the kernel prior to version 2.4, you will also need to configure the masquerade for the reply packets that pass back through the Director. Do this with the ipchains utility by using a command such as the following:

 /sbin/ipchains -A forward -j MASQ -s -d

Starting with kernel 2.4, however, you do not need to enter this command because LVS does not use the kernel's NAT code. The 2.0 series kernels also needed to use the ipfwadm utility, which you may see mentioned on the LVS website, but this is no longer required to build an LVS cluster.

The two most important lines in the preceding script are the lines that create the IP virtual server:

 /sbin/ipvsadm -A -t -s rr
 /sbin/ipvsadm -a -t -r -m

The first line specifies the VIP address and the scheduling method (-s rr). The choices for scheduling methods (which were described in the previous chapter) are as follows:

ipvsadm Argument

Scheduling Method

-s rr


-s wrr

Weighted round-robin

-s lc


-s wlc

Weighted least-connection

-s lblc

Locality-based least-connection

-s lblcr

Locality-based least-connection with replication

-s dh

Destination hashing

-s sh

Source hashing

-s sed

Shortest expected delay

-s nq

Never queue

In this recipe, we will use the round-robin scheduling method. In production, however, you should use a weighted, dynamic scheduling method (see Chapter 11 for explanations of the various methods).

The second ipvsadm line listed above associates the real server's RIP (-r with the VIP (or virtual server), and it specifies the forwarding method (-m). Each ipvsadm entry for the same virtual server can use a different forwarding method, but normally only one method is used. The choices are as follows:

ipvsadm Argument

Forwarding Method







In this recipe, we are building an LVS-NAT cluster, so we will use the -m option.[11]

When you have finished entering this script or modifying the copy on the CD-ROM, run it by typing this command:

 #/etc/init.d/lvs start

If the VIP address has been added to the Director (which you can check by using the ifconfig command), log on to the real server and try to ping this VIP address. You should also be able to ping the VIP address from a client computer (so long as there are no intervening firewalls that block ICMP packets).


ICMP ping packets sent to the VIP will be handled by the Director (these packets are not forwarded to a real server). However, the Director will try to forward ICMP packets relating to a connection to the relevant real server.

To see the IP Virtual Server table (the table we have just created that tells the Director how to forward incoming requests for services to real servers inside the cluster), enter the following command on the server you have made into a Director:

 #/sbin/ipvsadm -L -n

This command should show you the IPVS table:

  IP Virtual Server version 1.0.10 (size=4096)
 Prot LocalAddress:Port Scheduler Flags
   -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
 TCP rr
   ->                    Masq    1      0          0

This report wraps both the column headers and each line of output onto the next line. We have only one IP virtual server pointing to one real server, so our one (wrapped) line of output is:

 TCP rr
   ->                    Masq    1      0          0

This line says that our IP virtual server is using the TCP protocol for VIP address on port 80. Packets are forwarded (->) to RIP address on port 80, and our forwarding method is masquerading (Masq), which is another name for the Network Address Translation, or LVS-NAT. The LVS forwarding method reported in this field will be one of the following:

Report Output

LVS Forwarding Method







Step 6: Test the Cluster Configuration

The next step is to test the cluster configuration. In this step, <DR> will precede the commands that should be entered on the Director, and <RS> will precede the commands that should be entered on the real server.

The first thing to do is make sure the network interface cards are configured properly and are receiving packets. Enter the following command (on the Director):

 <DR>#ifconfig -a

The important thing to look for in the output of this command is whether or not the interface is UP, and how many packets have been transmitted (TX) and received (RX). In the following sample output I've marked this information in bold:

 eth0     Link encap:Ethernet  HWaddr 00:10:5A:16:99:8A
          inet addr:  Bcast:  Mask:
          RX packets:481 errors:0 dropped:0 overruns:0 frame:0
          TX packets:374 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          Interrupt:5 Base address:0x220
 eth0:1   Link encap:Ethernet  HWaddr 00:10:5A:16:99:8A
          inet addr:  Bcast:  Mask:
          Interrupt:5 Base address:0x220
       eth1      Link encap:Ethernet  HWaddr 00:80:5F:0E:AB:AB
          inet addr:  Bcast:  Mask:
          RX packets:210 errors:0 dropped:0 overruns:0 frame:0
          TX packets:208 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          Interrupt:11 Base address:0x1400

If you do not see the word UP displayed in the output of your ifconfig command for all of your network interface cards and network interface aliases (or IP aliases), as shown in the bold sections in this example, then the software driver for the missing network interface card is not configured properly (or the card is not installed properly). See Appendix C for more information on working with network interface cards.


If you use secondary IP addresses instead of IP aliases, use the ip addr sh command to check the status of the secondary IP addresses instead of the ifconfig -a command. You can then examine the output of this command for the same information (the report returned by the ip addr sh command is a slightly modified version of the sample output I've just given).

If the interface is UP but no packets have been transmitted (TX) or received (RX) on the card, you may have network cabling problems.

Once you have a working network configuration, test the communication from the real server to the Director by pinging the Director's DIP and then VIP addresses from the real server. To continue the example, you would enter the following commands on the real server:


The first of these commands pings the Director's cluster IP address (DIP), and the second pings the Director's virtual IP address (VIP). Both of these commands should report that packets can successfully travel from the real server through the cluster network to the Director and that the Director successfully recognizes both its DIP and VIP addresses. Once you are sure this basic cluster network's IP communication is working properly, use a client computer to ping the VIP address from outside the cluster.

When you have confirmed that all of these tests work, you are ready to test the service from the real server. Use the Lynx program to send an HTTP request to the locally running Apache server on the real server:

 <RS>#lynx -dump

If this does not return the test web page message, check to make sure HTTPd is running by issuing the following commands (on the real server):

 <RS>#ps -elf | grep httpd
 <RS>#netstat -apn | grep httpd

The first command should return several lines of output from the process table, indicating that several HTTPd daemons are running. The second command should show that these daemons are listening on the proper network ports (or at least on port 80). If either of these commands does not produce any output, you need to check your Apache configuration on the real server before continuing.

If the HTTPd daemons are running, you are ready to access the real server's HTTPd server from the Director. You can do so with the following command (on the Director):

 <DR>#lynx -dump

This command, which specifies the real server's IP address (RIP), should display the test web page from the real server.

If all of these commands work properly, use the following command to watch your LVS connection table on the Director:

 <DR>#watch ipvsadm -Ln

At first, this report should indicate that no active connections have been made through the Director to the real server, as shown here:

 IP Virtual Server version 1.0.10 (size=4096)
 Prot LocalAddress:Port Scheduler Flags
   -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
 TCP rr
   ->                    Masq    1      0          0

Leave this command running on the Director (the watch command will automatically update the output every two seconds), and from a client computer use a web browser to display the URL that is the Director's virtual IP address (VIP):

If you see the test web page, you have successfully built your first LVS cluster.

If you look quickly back at the console on the Director, now, you should see an active connection (ActiveConn) in the LVS connection tracking table:

 IP Virtual Server version 1.0.10 (size=4096)
 Prot LocalAddress:Port Scheduler Flags
   -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
 TCP rr
   ->                    Masq    1      1          0

In this report, the number 1 now appears in the ActiveConn column. If you wait a few seconds, and you don't make any further client connections to this VIP address, you should see the connection go from active status (ActiveConn) to inactive status (InActConn):

 IP Virtual Server version 1.0.10 (size=4096)
 Prot LocalAddress:Port Scheduler Flags
   -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
 TCP rr
   ->                    Masq    1      0          1

In this report, the number 1 now appears in the InActConn column, and then it will finally drop out of the LVS connection tracking table altogether:

 IP Virtual Server version 1.0.10 (size=4096)
 Prot LocalAddress:Port Scheduler Flags
   -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
 TCP rr
   ->                    Masq    1      0          0

The ActiveConn and InActConn columns are now both 0.

[5]RFC 1918 reserves the following IP address blocks for private intranets:

  • through

  • through

  • through

[6]Technically the real server (cluster node) can run any OS capable of displaying a web page.

[7]As mentioned previously, you can use a separate VLAN on your switch instead of a mini hub.

[8]For this to work, you must have selected the option to install the text-based web browser when you loaded the operating system. If you didn't, you'll need to go get the Lynx program and install it.

[9]As the system boots, watch for IPVS messages (if you compiled the IPVS schedulers into the kernel and not as modules) for each of the scheduling methods you compiled for your kernel.

[10]Many distributions also include the sysctl command for modifying the /proc pseudo filesystem. On Red Hat systems, for example, this kernel capability is controlled by sysctl from the /etc/ rc.d/init.d/network script, as specified by a variable in the /etc/sysctl.conf file. This setting will override your custom LVS script if the network script runs after your custom LVS script. See Chapter 1 for more information about the boot order of init scripts.

[11]Note that you can't simply change this option to -g to build an LVS-DR cluster.

LocalNode: Using the Director as a Real Server

When you build an enterprise-class cluster, you will need to decide whether you want to dedicate two Linux servers to be the primary and backup LVS Directors. For the moment, however, you can test to make sure the load-balancing function is working properly (regardless of what your final design configuration will be) by configuring and installing Apache on the Director (as described in the "Step 2: Configure and Start Apache on the Real Server" section of this chapter) and then configuring your Director to send client computer requests to this locally running copy of Apache as if the Director were also a node inside the cluster. This configuration is called LocalNode or LocalNode mode.


A limitation of using a LocalNode configuration is that the Director cannot remap port numbers—whatever port number the client computer used in its request to access the service must also be used on the Director to service the request. The LVS HOWTO describes a workaround for this current limitation in LVS that uses iptables/ipchains redirection.

Assuming you have Apache up and running on the Director, you can modify the /etc/init.d/lvs script on the Director so two cluster nodes are available to client computers (the Director and the real server) as follows:

 # Add an IP virtual service for VIP port 80
 /sbin/ipvsadm -A -t -s rr
 # Now direct packets for this VIP to
 # the real server IP (RIP) inside the cluster
 /sbin/ipvsadm -a -t -r -m
 # And send requests to the locally running Apache
 # server on the Director.
 /sbin/ipvsadm -a -t -r -m

The last two uncommented lines cause the Director to send packets destined for the VIP address to the RIP ( and to the locally running daemons on the Director (

Use ipvsadm to display the newly modified IP virtual server table:

 <DR>#ipvsadm -Ln
 IP Virtual Server version 1.0.10 (size=4096)
 Prot LocalAddress:Port Scheduler Flags
   -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
 TCP rr
   ->                   Masq    1      0          0
   ->                    Masq    1      0          0

Now we have two servers capable of responding to requests received on—the cluster node and the Director.

Use the Lynx program to make sure you have created a default web page on your Director that is different from the default web page used on your real server. On the Director type this command:

 <DR>#lynx -dump

This should produce output that shows the default web page you have created on your Director.


For testing purposes you may want to create test pages that display the name of the real server that is displaying the page. See Appendix F for more info on how to configure Apache and how to use virtual hosts.

Now you can watch the IP virtual server round-robin scheduling method in action. From a client computer (outside the cluster) call up the web page the cluster is offering by using the virtual IP address in the URL:

This URL specifies the Director's virtual IP address (VIP). Watch for this connection on the Director and make sure it moves from the ActiveConn to the InActConn column and then expires out of the IP virtual server table. Then click the Refresh button[12] on the client computer's web browser, and the web page should change to the default web page used on the real server.

[12]You may need to hold down the SHIFT key while clicking the Refresh button to avoid accessing the web page from the cached copy the web browser has stored on the hard drive. Also, be aware of the danger of accessing the same web page stored in a cache proxy server (such as Squid) if you have a proxy server configured in your web browser.

In Conclusion

In this chapter we looked at how client computers communicate with real servers inside an LVS-NAT cluster, and we built a simple LVS-NAT web cluster. As I mentioned in Chapter 10, the LVS-NAT cluster will probably be the first type of LVS cluster that you build because it is the easiest to get up and working. If you've followed along in this chapter you now have a working two-node LVS-NAT cluster. But don't stop there. Continue on to the next chapter to learn how to improve the LVS-NAT configuration and build an LVS-DR cluster.

Часть 13: LVS-DR Cluster


The Linux Virtual Server Direct Routing (LVS-DR) cluster is made possible by configuring all nodes in the cluster and the Director with the same VIP address; despite having this common address, though, client computers will only send their packets to the Director.

The Director can, therefore, balance the incoming workload from the client computers by using one of the LVS scheduling methods we looked at in Chapter 12.

The LVS-DR cluster configuration that we use in this chapter assumes that the Director is a computer dedicated to this task. In Chapter 14 we'll take a closer look at what is going on inside this computer, and in Chapter 15 we'll see how the load-balancing resource[1] can be placed on a real server and made highly available using the Heartbeat package.

But before we build an enterprise-class, highly available LVS-DR cluster (in Chapter 15), let's examine how the LVS-DR forwarding method works in more detail.

[1]Recall from our discussion of Heartbeat in Chapter 6 that a service, along with its associated IP address, is known as a resource. Thus, the virtual services offered by a Linux Virtual Server and their associated VIPs can also be called resources in high-availability terminology.

How Client Computers Access LVS-DR Cluster Services

Let's examine the TCP network communication that takes place between a client computer and the cluster. As with the LVS-NAT cluster network communication described in Chapter 12, the LVS-DR TCP communication starts when the client computer sends a request for a service running on the cluster, as shown in Figure 13-1.[2]

Image from book
Figure 13-1: In packet 1 the client sends a request to the LVS-DR cluster

The first packet, shown in Figure 13-1, is sent from the client computer to the VIP address. Its data payload is an HTTP request for a web page.


An LVS-DR cluster, like an LVS-NAT cluster, can use multiple virtual IP (VIP) addresses, so we'll number them for the sake of clarity.

Before we focus on the packet exchange, there are a couple of things to note about Figure 13-1. The first is that the network interface card (NIC) the Director uses for network communication (the box labeled VIP1 connected to the Director in Figure 13-1) is connected to the same physical network that is used by the cluster node and the client computer. The VIP the RIP and the CIP, in other words, are all on the same physical network (the same network segment or VLAN).


You can have multiple NICs on the Director to connect the Director to multiple VLANs.

The second is that VIP1 is shown in two places in Figure 13-1: it is in a box representing the NIC that connects the Director to the network, and it is in a box that is inside of real server 1. The box inside of real server 1 represents an IP address that has been placed on the loopback device on real server 1. (Recall that a loopback device is a logical network device used by all networked computers to deliver packets locally.) Network packets[3] that are routed inside the kernel on real server 1 with a destination address of VIP1 will be sent to the loopback device on real server 1—in other words, any packets found inside the kernel on real server 1 with a destination address of VIP1 will be delivered to the daemons running locally on real server 1. (We'll see how packets that have a destination address of VIP1 end up inside the kernel on real server 1 shortly.)

Now let's look at the packet depicted in Figure 13-1. This packet was created by the client computer and sent to the Director. A technical detail not shown in the figure is the lower-level destination MAC address inside this packet. It is set to the MAC address of the Director's NIC that has VIP1 associated with it, and the client computer discovered this MAC address using the Address Resolution Protocol (ARP).

An ARP broadcast from the client computer asked, "Who owns VIP1?" and the Director replied to the broadcast using its MAC address and said that it was the owner. The client computer then constructed the first packet of the network conversation and inserted the proper destination MAC address to send the packet to the Director. (We'll examine a broadcast ARP request and see how they can create problems in an LVS-DR cluster environment later in this chapter.)


When the cluster is connected to the Internet, and the client computer is connected to the cluster over the Internet, the client computer will not send an ARP broadcast to locate the MAC address of the VIP. Instead, when the client computer wants to connect to the cluster, it sends packet 1 over the Internet, and when the packet arrives at the router that connects the cluster to the Internet, the router sends the ARP broadcast to find the correct MAC address to use.

When packet 1 arrives at the Director, the Director forwards the packet to the real server, leaving the source and destination addresses unchanged, as shown in Figure 13-2. Only the MAC address is changed from the Director's MAC address to the real server's (RIP) MAC address.

Image from book
Figure 13-2: In packet 2 the Director forwards the client computer's request to a cluster node

Notice in Figure 13-2 that the source and destination IP address have not changed in packet 2: CIP1 is still the source address, and VIP1 is still the destination address. The Director, however, has changed the destination MAC address of the packet to that of the NIC on real server 1 in order to send the packet into the kernel on real server 1 (though the MAC addresses aren't shown in the figure). When the packet reaches real server 1, the packet is routed to the loopback device, because that's where the routing table inside the kernel on real server 1 is configured to send it. (In Figure 13-2, the box inside of real server 1 with the VIP1 address in it depicts the VIP1 address on the loopback device.) The packet is then received by a daemon running locally on real server 1 listening on VIP1, and that daemon knows what to do with the packet—the daemon is the Apache HTTPd web server in this case.

The HTTPd daemon then prepares a reply packet and sends it back out through the RIP1 interface with the source address set to VIP1, as shown in Figure 13-3.

Image from book
Figure 13-3: In packet 3 the cluster node sends a reply back through the Director

The packet shown in Figure 13-3 does not go back through the Director, because the real servers do not use the Director as their default gateway in an LVS-DR cluster. Packet 3 is sent directly back to the client computer (hence the name direct routing). Also notice that the source address is VIP1, which real server 1 took from the destination address it found in the inbound packet (packet 2).

Notice the following points about this exchange of packets:

  • The Director must receive all the inbound packets destined for the cluster.

  • The Director only receives inbound cluster communications (requests for services from client computers).

  • Real servers, the Director, and client computers can all share the same network segment.

  • The real servers use the router on the production network as their default gateway (unless you are using the LVS martian patch on your Director). If client computers will always be on the same network segment as the cluster nodes, you do not need to configure a default gateway for the real servers.[4]

[2]We are ignoring the lower-level TCP connection request (the TCP handshake) in this discussion for the sake of simplicity.

[3]When the kernel holds a packet in memory it places the kernel into an area of memory that is references with a pointer called a socket buffer or sk_buff, so, to be completely accurate in this discussion I should use the term sk_buff instead of packet every time I mention a packet inside the director.

[4]This, however, would be an unusual configuration, because real servers will likely need to access both an email server and a DNS server residing on a different network segment.

ARP Broadcasts and the LVS-DR Cluster

As we've just seen, placing VIP addresses on the loopback (lo) device on each cluster node allows the cluster nodes in an LVS-DR cluster to accept packets that are destined for the VIP address. However, this has one dangerous side effect: the real servers inside the cluster will try to reply to ARP broadcasts from client computers that are looking for the VIP. Unless special precautions are taken, the real servers will claim to own the VIP address, and client computers will send their packets directly to real servers, thus circumventing the cluster load-balancing method and destroying the integrity of network communication with the Director (where packets that use the VIP as their destination address are supposed to go).

To understand this problem (called "The ARP Problem" in the LVS- HOWTO), let's look at how a client computer uses the VIP address to find the correct MAC address by using ARP.

Client Computers and ARP Broadcasts

Figure 13-4 shows a client computer sending an ARP broadcast to an LVS-DR cluster. Notice that because the Director and the cluster node (real server 1) are connected to the same network, they will both receive the ARP broadcast asking, "Who owns VIP1?"

Image from book
Figure 13-4: An ARP broadcast to an LVS-DR cluster

In Figure 13-4, gray arrows represent the path taken by an ARP broadcast sent from the client computer. The ARP broadcast packet is sent to all nodes connected to the local network (the VLAN or physical network segment), so a gray arrow is shown on the physical wires that connect the Director and real server 1 to the network switch. This is normal network behavior.

However, we want real server 1 to ignore this ARP request and only the LVS-DR Director to respond to it, as shown in Figure 13-5. In the figure, a gray arrow depicts the path of the ARP reply. It should only come from the Director and not real server 1.

Image from book
Figure 13-5: An ARP response from the LVS-DR Director

To prevent real servers from replying to ARP broadcasts for the LVS-DR cluster VIP, we need to hide the loopback interface on all of the real servers. Several techniques are available to accomplish this, and they are described in the LVS-HOWTO.


Starting with Kernel version 2.4.26, the stock Linux kernel contains the code necessary to prevent real servers from replying to ARP broadcasts. This is discussed in Chapter 15.

In Conclusion

We've examined the LVS-DR forwarding method in detail in this chapter, and we examined a sample LVS-DR network conversation between a client computer and a cluster node. We've also briefly described a potential problem with ARP broadcasts when you build an LVS-DR cluster.

In Chapter 15, we will see how to build a high-availability, enterprise-class LVS-DR cluster. Before we do so, though, Chapter 14 will look inside the load balancer.

Часть 14: Load Balancer

In this chapter, we take a closer look at what happens when the Director receives a packet destined for a real server (a cluster node). This will lead us into a discussion of LVS persistence—the technique LVS uses to assign the same client computer to a particular real server. We'll also describe how packets can be selected with the iptables utility for processing by LVS using a technique called packet marking.

LVS and Netfilter

In Figure 14-1, the five Netfilter hooks that were introduced in Chapter 2 are shown. Superimposed on top of these hooks is a series of small black boxes representing packets passing through the kernel. The kernel places each packet it receives into a memory structure called a socket buffer, or sk_buff for short. Each of the little black boxes in Figure 14-1 is thus representing an sk_buff inside the kernel, but in this discussion we'll continue to call them packets. The gray arrows in the figure represent the path that all incoming LVS packets (packets from client computers) take as they pass through the Director on their way to a real server (cluster node).

Image from book
Figure 14-1: Incoming packets inside the Director

Let's begin by looking at the five Netfilter hooks introduced in Chapter 2. Figure 14-1 shows these five hooks in the kernel.

Notice in Figure 14-1 that incoming LVS packets hit only three of the five Netfilter hooks: PRE_ROUTING, LOCAL_IN, and POST_ROUTING.[1] Later in this chapter, we'll discuss the significance of these Netfilter hooks as they relate to your ability to control the fate of a packet on the Director. For the moment, we want to focus on the path that incoming packets take as they pass through the Director on their way to a cluster node, as represented by the two gray arrows in Figure 14-1.

The first gray arrow in Figure 14-1 represents a packet passing from the PRE_ROUTING hook to the LOCAL_IN hook inside the kernel. Every packet received by the Director that is destined for a cluster service regardless of the LVS forwarding method you've chosen[2] must pass from the PRE_ROUTING hook to the LOCAL_IN hook inside the kernel on the Director. The packet routing rules inside the kernel on the Director, in other words, must send all packets for cluster services to the LOCAL_IN hook. This is easy to accomplish when you build a Director because all packets for cluster services that arrive on the Director will have a destination address of the virtual IP (VIP) address. Recall from our discussion of the LVS forwarding methods in the last three chapters that the VIP is an IP alias or secondary IP address that is owned by the Director. Because the VIP is a local IP address owned by the Director, the routing table inside the kernel on the Director will always try to deliver packets locally. Packets received by the Director that have the VIP as a destination address will therefore always hit the LOCAL_IN hook.

The second gray arrow in Figure 14-1 represents the path taken by the incoming packets after the kernel has recognized that a packet is a request for a virtual service. When a packet hits the LOCAL_IN hook, the LVS software running inside the kernel knows the packet is a request for a cluster service (called a virtual service in LVS terminology) because the packet is destined for a VIP address. When you build your cluster, you use the ipvsadm utility to add virtual service VIP addresses to the kernel so LVS can recognize incoming packets sent to the VIP in the LOCAL_IN hook. If you had not already added the VIP to the LVS virtual service table (also known as the IPVS table), the packets destined for the VIP address would be delivered to the locally running daemons on the Director. But because LVS knows the VIP address,[3] it can check each packet as it hits the LOCAL_IN hook to decide if the packet is a request for a cluster service. LVS can then alter the fate of the packet before it reaches the locally running daemons on the Director. LVS does this by telling the kernel that it should not deliver a packet destined for the VIP address locally, but should send the packet to a cluster node instead. This causes the packet to be sent to the POST_ROUTING hook as depicted by the second gray arrow in Figure 14-1. The packets are sent out the NIC connected to the D/RIP network.[4]

The power of the LVS Director to alter the fate of packets is therefore made possible by the Netfilter hooks and the IPVS table you construct with the ipvsadm utility. The Director's ability to alter the fate of network packets makes it possible to distribute incoming requests for cluster services across multiple real servers (cluster nodes), but to do this, the Director must keep track of which real servers have been assigned to each client computer so the client computer will always talk to the same real server.[5] The Director does this by maintaining a table in memory called the connection tracking table.


As we'll see later in this chapter, there is a difference between making sure all requests for services (all new connections) go to the same cluster node and making sure all packets for an established connection return to the same cluster node. The former is called persistence and the latter is handled by connection tracking.

[1]LVS also uses the IP_FORWARD hook to recognize LVS-NAT reply packets (packets sent from the cluster nodes to client computers).

[2]LVS forwarding methods were introduced in Chapter 11.

[3]Or VIP addresses.

[4]The network that connects the real servers to the Director.

[5]Throughout the IP network conversation or session for a particular cluster service.

The Director's Connection Tracking Table

The Director's connection tracking table (also sometimes called the IPVS connection tracking table, or the hash table) contains 128 bytes for each new connection from a client computer in order to store just enough information to return packets to the same real server when the client computer sends in another network packet during the same network connection.

Hash Table Structure

The tracking table consists of both rows and columns. Each row is called a hash bucket, and each column is a connection tracking record. Each record in the connection tracking table contains timer information; the protocol used by the packet; the client's IP address or CIP and the client's port number; the virtual IP address or VIP and the VIP port number; and some additional control information. Each row, or hash bucket, can contain an unlimited number of these records.

Because the Director looks for a match in this table every time it receives a packet, table lookups need to be as fast as possible. LVS speeds up table lookups by using a hashing technique to determine which row it should search first. Ideally, then, the table should have a very small number of connection records stored in each bucket (each row), because it takes less time for the hash function to locate the correct bucket than it would to run a sequential search through a large number of records within the bucket. The LVS programmers recommend 16 records per row and no more than 20.

Controlling the Hash Buckets

You control the number of buckets (rows) in the LVS hash table with the kernel option "IP masquerading VS table size" in kernel 2.2, and with "IPVS connection table size" in 2.4 and later. The number two is raised to the power of the value you enter for this kernel parameter; by default, LVS will create 212 or 4,096 hash buckets in the LVS hash table. This is the number of hash buckets the Director will use to speed connection record lookups, not the maximum number of simultaneous connections your Director will support. (The number of connection records the table will hold is limited only by the amount of available memory on the Director.)


One client may have multiple connection tracking table entries if it is accessing cluster resources on different ports (each connection is one connection tracking record in the connection tracking table).

Viewing the Connection Tracking Table

In the 2.4 and later series kernel, you can view the contents of the connection tracking table with the command:[6]

 #ipvsadm -lcn

The size of the connection tracking table is displayed when you run the ipvsadm command:

 IP Virtual Server version 0.8.2 (size=4096)

This first line of output from ipvsadm shows that the size of the connection tracking table is 4,096 bytes (the default).

[6]In the 2.2 series kernel, you can view the contents of this table with the command #netstat -Mn.

Timeout Values for Connection Tracking Records

When network communication between a client computer and a real server (cluster node) is no longer active, the timeout value for a connection tracking record is set to expire using a timeout value in the connection tracking record. In other words, for services like telnet that use the TCP protocol, the Director will hold the connection tracking record in memory as long as the TCP connection is in an ESTABLISHED state and packets are received from the client computer; when the TCP connection drops, the connection tracking record timeout value is set so the record will eventually be removed from the connection tracking table.

LVS also uses a larger connection tracking timeout value for all connection tracking records in the hash table so that it can remove connections that remain unused for a long period of time.


The kernel also has TCP session timeout values, but they are much larger than the values imposed by LVS. For example, the ESTABLISHED TCP connection timeout value is five days in the 2.4 kernel. See the kernel source code file ip_conntrack_proto_tcp.c for a complete list and the default values used by the kernel.

LVS has three important timeout values for expiring connection tracking records:

  • A timeout value for idle TCP sessions.

  • A timeout value for TCP sessions after the client computer has closed the connection (a FIN packet was received from the client computer).[7]

  • A timeout value for UDP packets. Because UDP is a connectionless protocol, the LVS Director expires UDP connection tracking records if another packet from the client computer is not received within an arbitrary timeout period.


To see the default values for these timers on a 2.4 series kernel, look at the contents of the timeout_* files in the /proc/sys/net/ipv4/vs/ directory.[8] As of this writing, these values are not implemented in the 2.6 kernel, but they will be replaced with setsockopt controls under ipvsadm's control. See the latest version of the ipvsadm man page for details.

These three timeout values can be modified by specifying the number of seconds to use for each timer using the --set argument of the ipvsadm command. All three values must be specified when you use the --set argument, so the command:

 #ipvsadm --set 28800 30 36000

sets the connection tracking record timeout values to: 8 hours for established but idle TCP sessions, 30 seconds for TCP sessions after a FIN packet is received, and 10 minutes for each UDP packet.[9]

To implement these timeouts, LVS uses two tables: a connection timeout table called the ip_vs_timeout_table and connection state table called tcp_states. When you use ipvsadm to modify timeout values as shown in the previous example, the changes are applied to all current and future connections tracked by LVS inside the Director's kernel using these two tables.


If you set /proc/sys/net/ipv4/vs/secure_tcp to a nonzero value, LVS uses a different pair of tables to implement timeouts called vs_timeout_table_dos and vs_tcp_states_dos. You must, therefore, reissue the ipvsadm command to set timeout values whenever you enable or disable secure_tcp.

We will discuss another timer that the Director uses to return a client computer's request for service to the same real server, called the persistence timeout, shortly. But first, let's look at how the Director handles packets going in the other direction: from the cluster node to the client computer.

[7]For a discussion of TCP states and the FIN packet, see RFC 793.

[8]The values in this directory are only used on 2.4 kernels and only if the /proc/sys/net/ipv4/vs/ secure_tcp is nonzero. Additional timers and variables that you find in this directory are documented on the sysctrl page at the LVS website (currently at These sysctrl variables are normally only modified to improve security when building a public web cluster susceptible to a DoS attack.

[9]A value of 0 indicates that the default value should be used (it does not represent infinity).

Return Packets and the Netfilter Hooks

So far in this chapter, we have only discussed incoming packets for virtual services and how these packets pass through the kernel on the Director. When you build an LVS-NAT cluster, however, packets will also pass back through the Director in the opposite direction as real servers reply to the client computers. Recall from Chapter 12 that these reply packets must pass back through the Director in an LVS-NAT cluster because the Director needs to perform the Network Address Translation to convert the source IP address in the packets from the real server's RIP to the Director's VIP.

The source IP address Network Address Translation for reply packets from the real servers is made possible on the Director thanks to the fact that LVS is inserted into the FORWARD hook. As packets "walk" back through the Netfilter hooks on their way back through the Director to the client computer, LVS can look them up in its connection tracking table to find out which VIP address it should use to replace the RIP address inside the packet header (again, this is only for LVS-NAT).

Figure 14-2 shows the return path of packets as they pass back through the Director on their way to a client computer when LVS-NAT is used as the forwarding method.

Image from book
Figure 14-2: Outgoing LVS-NAT packets inside the Director

Notice in Figure 14-2 that the NIC depicted on the left side of the diagram is now the eth1 NIC that is connected to the DRIP network (the network the real servers and Director are connected to). This diagram, in other words, is showing packets going in a direction opposite to the direction of the packets depicted in Figure 14-1. The kernel follows the same rules for processing these inbound reply packets coming from the real servers on its eth1 NIC as it did when it processed the inbound packets from client computers on its eth0 NIC. However, in this case, the destination address of the packet does not match any of the routing rules in the kernel that would cause the packet to be delivered locally (the destination address of the packet is the client computer's IP address or CIP). The kernel therefore knows that this packet should be sent out to the network, so the packet hits the FORWARD hook, shown by the first gray arrow in Figure 14-2.

This causes the LVS code[10] to demasquerade the packet by replacing the source address in the packet (sk_buff) header with the VIP address. The second gray arrow shows what happens next: the packet hits the POST_ROUTING hook before it is finally sent out the eth0 NIC. (See Chapter 12 for more on LVS-NAT.)

Regardless of which forwarding method you use, the LVS hooks in the Netfilter code allow you to control how real servers are assigned to client computers when new requests come in, even when multiple requests for cluster services come from a single client computer. This brings us to the topic of LVS persistence. But first, let's have a look at LVS without persistence.

[10]Called ip_vs_out.

LVS without Persistence

LVS clusters are normally built without using persistence. The general rule for a Linux Enterprise Cluster is that each new connection to the cluster is sent to the least loaded real server (the cluster node with the fewest number of active connections) in order to achieve the best possible distribution of the workload.[11] To achieve this, you create an IP virtual server table and specify each port separately with commands like these (for LVS-NAT):

 /sbin/ipvsadm -A -t -s wrr
 /sbin/ipvsadm -a -t -r -m
 /sbin/ipvsadm -a -t -r -m

The first command creates the LVS virtual service for port 23 (the telnet port) using VIP The next two lines specify which real servers are available to service requests that come in on port 23 for this VIP. (Recall from Chapter 12 that the -m option tells the Director to use the LVS-NAT forwarding method, though any LVS forwarding method would work.)

As the Director receives each new network connection request, the real server with the fewest connections is selected, using the weighted round robin or wrr scheduling method. As a result, if a client computer opens three telnet sessions to the cluster, it may be assigned to three different real servers.

This may not be desirable, however. If, for example, you are using a licensing scheme such as the Flexlm floating license[12] (which can cause a single user to use multiple licenses when they run several instances of the same application on multiple real servers), you may want all of the connections from a client computer to go to the same real server regardless of how that affects load balancing.

Keep in mind, however, that using LVS persistence can lead to load-balancing problems when many clients appear to come from a single IP address (they may all be behind a single NAT firewall), or if they are, in fact, all coming from a single IP address (all users may be logged on to one terminal server or one thin client server[13]). In such a case, all of the client computers will be assigned to a single real server and possibly overwhelm it even though other nodes in the cluster are available to handle the workload. For this reason, you should avoid using LVS persistence if you are using Thin Clients or a terminal server.

In the next few sections, we'll examine LVS persistence more closely and discuss when to consider using it in a Linux Enterprise Cluster.

[11]Here I am making a general correlation between number of connections and workload.

[12]Flexlm has other more cluster-friendly license schemes such as a peruser license scheme.

[13]See the Linux Terminal Server Project at

LVS Persistence

Regardless of the LVS forwarding method you choose, if you need to make sure all of the connections from a client return to the same real server, you need LVS persistence. For example, you may want to use LVS persistence to avoid wasting licenses if one user (one client computer) needs to run multiple instances of the same application.[14] Persistence is also often desirable with SSL because the key exchange process required to establish an SSL connection will only need to be done one time when you enable persistence.

Persistent Connection Template

When using LVS persistence, the Director is internally using a connection tracking record called a persistent connection template to ensure that all connections from the client computer are assigned to the same real server. As the client computer makes connection requests to the cluster, the director creates a normal connection tracking record for each connection, but it does so only after it looks at the persistent connection template record and decides which real server has already been assigned to this type of connection. What do I mean by type of connection? I'll discuss that in a moment, but first let me explain how the Director removes persistent connection templates based on a timeout value you specify.

Persistence Timeout

Use the ipvsadm utility to specify a timeout value for the persistent connection template on the Director. I'll show you which ipvsadm commands to use to set the persistence timeout for each type of connection in a moment when I get to the discussion of the types of LVS persistence. The timeout value you specify is used to set a timer for each persistent connection template as it is created. The timer will count down to zero whether or not the connection is active, and can be viewed[15] with ipvsadm -L -c.

If the counter reaches zero and the connection is still active (the client computer is still communicating with the real server), the counter will reset to a default value[16] of two minutes regardless of the persistent timeout value you specified and will then begin counting down to zero again. The counter is then reset to the default timeout value each time the counter reaches 0 as long as the connection remains active.


You can also use the ipvsadm utility to specify a TCP session timeout value that may be larger than the persistent timeout value you specified. A large TCP session timeout value will therefore also increase the amount of time a connection template entry remains on the Director, which may be greater than the persistent timeout value you specify.

Types of Persistent Connections

Now that I've introduced you to the method LVS uses to expire unused persistent connection template records, let's examine the five types of persistent connections. They are:

  1. Persistent client connections (PCC), which cause all services a client is accessing to persist. (Also called zero port connections.)

  2. Persistent port connections (PPC), which cause a single service to persist.

  3. Persistent Netfilter Marked Packet persistence, which causes packets that have been marked with the ipchains/iptables utility to persist.

  4. FTP connections (FTP connections require careful handling due to the complex[17] nature of FTP connections).

  5. Expired persistence, which is used internally by the Director to expire connection tracking entries when the persistent connection template expires.[18]


If you are building a web cluster, you may need to set the persistence granularity that LVS should use to group CIPs. Normally, each CIP is treated as a unique address when LVS looks up records in the persistent connection template. You can, however, group CIPs using a network mask (see the -M option on the ipvsadm man page and the LVS HOWTO for details).

We are most interested in PPC, PCC, and Netfilter Marked Packet persistence.

Persistent Client Connection (PCC)

A persistent client connection (PCC) forces all connections from a client computer to a single real server. A PCC is simply a virtual service created with no port number (or port number 0) and with the -p flag set. You would use a persistence client connection when you want all of the connections from a client computer to go to the same real server. If a customer adds items to a shopping cart on your web cluster using the HTTP protocol and then clicks the checkout button to use the encrypted HTTPS protocol, you want to use a persistent client connection so that both port 80 (HTTP) and port 443 (HTTPS) will go to the same real server inside the cluster. For example:

 /sbin/ipvsadm -A -t -s rr -p
 /sbin/ipvsadm -a -t -r -m
 /sbin/ipvsadm -a -t -r -m

These three lines create a PCC virtual service on VIP address using real servers and

The default timeout value of 360 seconds can be modified by supplying the -p option with the number of seconds that the persistent connection template should remain on the Director. For example, to create a one-hour PCC virtual service, use:

 /sbin/ipvsadm -A -t -s rr -p 3600
 /sbin/ipvsadm -a -t -r -m
 /sbin/ipvsadm -a -t -r -m

Persistent Port Connection (PPC)

A persistent port connection (PPC) forces all connections from a client computer for a particular destination port number to the same real server. For example, let's say you want to allow a user to create multiple telnet sessions,[19] and you would like all of the telnet sessions to go to the same real server; however, when the user calls up a web page, you'd like to assign this request to any node, regardless of which real server they are using for telnet. In this case, you could use persistent port connections for the telnet port (port 23) and the HTTP port (port 80), as follows:

 /sbin/ipvsadm -A -t -s rr -p 3600
 /sbin/ipvsadm -a -t -r -m
 /sbin/ipvsadm -a -t -r -m
 /sbin/ipvsadm -A -t -s rr -p 3600
 /sbin/ipvsadm -a -t -r -m
 /sbin/ipvsadm -a -t -r -m

Port Affinity

The key difference between PCC and PPC persistence is sometimes called port affinity. With PCC persistence, all connections to the cluster from a single client computer end up on the same real server. Thus, a client computer talking on port 80 (using HTTP) will connect to the same real server when it needs to use port 443 (for HTTPS). When you use PCC, ports 80 and 443 are therefore said to have an affinity with each other.

On the other hand, when you create an IPVS table with multiple ports using PPC, the Director creates one connection tracking template record for each port the client uses to talk to the cluster so that one client computer may be assigned to multiple real servers (one for each port used). With PPC persistence, the ports do not have an affinity with each other.

In fact, when using PCC persistence, all ports have an affinity with each other. This may increase the chance of an imbalanced cluster load if client computers need to connect to several port numbers to use the cluster. To create port affinity that only applies to a specific set of ports (port 80 and port 443 for HTTP and HTTPS communication, for example), use Netfilter Marked Packets.

Netfilter Marked Packets

Netfilter Marked Packets (formerly called fwmarked packets) have been marked by either the ipchains or the iptables utility on the Director. The Netfilter mark only affects the packet while it is on the Director; once the packet leaves the Director, it is no longer marked (real servers can't see the marks made on the Director).

When you specify the criteria for marking a packet (using iptables or ipchains), you assign a mark number. This number is then associated with an IP virtual service (using the ipvsadm utility), so that the marked packets will be sent to the proper real server.

The Netfilter mark number is placed in the socket buffer (called sk_buff), not in the packet header, and is only associated with the packet while it is being processed by the kernel on the Director.

Your iptables rules cause the Netfilter mark number to be placed inside the packet sk_buff entry before the packet passes through the routing process in the PRE_ROUTING hook. Once the packet completes through the routing process and the kernel decides which packets should be delivered locally, it reaches the LOCAL_IN hook. At this point, the LVS code sees the packet and can check the Netfilter mark number to determine which IP virtual server (IPVS) to use, as shown in Figure 14-3.

Image from book
Figure 14-3: Netfilter Marked Packets and LVS

Notice in Figure 14-3 that the Netfilter mark is placed into the incoming packet header (shown as a white square inside of the black box representing packets) in the PRE_ROUTING hook. The Netfilter mark can then be read by the LVS code that searches for IPVS packets in the LOCAL_IN hook. Normally, only packets destined for a VIP address that has been configured as a virtual service by the ipvsadm utility are selected by the LVS code, but if you are using Netfilter to mark packets as shown in this diagram, you can build a virtual service that selects packets based on the mark number.[20] LVS can then send these packets back out a NIC (shown as an eth1 in this example) for processing on a real server.

Marking Packets with iptables

For example, say we want to create one-hour persistent port affinity between ports 80 and 443 on VIP for real servers and We'd use these iptables and ipvsadm commands:

 /sbin/iptables -F -t mangle
 /sbin/iptables -A PREROUTING -i eth0 -t mangle -p tcp \
     -d --dport 80 -j MARK --set-mark 99
 /sbin/iptables -A PREROUTING -i eth0 -t mangle -p tcp \
     -d --dport 443 -j MARK --set-mark 99
 /sbin/ipvsadm -A -f 99 -s rr -p 3600
 /sbin/ipvsadm -a -f 99 -r -m
 /sbin/ipvsadm -a -f 99 -r -m

The \ character in the above commands means the command continues on the next line.

The command in the first line flushes all of the rules out of the iptables mangle table. This is the table that allows you to insert your own rules into the PRE_ROUTING hook.

The commands on the second line contain the criteria we want to use to select packets for marking. We're selecting packets that are destined for our VIP address ( with a netmask of The /24 indicates the first 24 bits are the netmask. The balance of that line says to mark packets that are trying to reach port 80 and port 443 with the Netfilter mark number 99.

The commands on the last three lines use ipvsadm to send these packets using round robin scheduling and LVS-NAT forwarding to real servers and

To view the iptables we have created, you would enter:

 #iptables -L -t mangle -n
 target      prot opt source     destination
 MARK        tcp  --    tcp dpt:80 MARK set 0x63
 MARK        tcp  --    tcp dpt:443 MARK set 0x63
 Chain OUTPUT (policy ACCEPT)
 target      prot opt source     destination

Recall from Chapter 2 that iptables uses table names instead of Netfilter hooks to simplify the administration of the kernel netfilter rules. The chain called PREROUTING in this report refers to the netfilter PRE_ROUTING hook shown in Figure 14-3. So this report shows two MARK rules that will now be applied to packets as they hit the PRE_ROUTING hook based on destination address, protocol (tcp), and destination ports (dpt) 80 and 443. The report shows that packets that match this criteria will have their MARK number set to hexa-decimal 63 (0x63), which is decimal 99.

Now, let's examine the LVS IP Virtual Server routing rules we've created with the following command:

 #ipvsadm -L -n
 IP Virtual Server version x.x.x (size=4096)
 Prot LocalAddress:Port Scheduler Flags
   -> RemoteAddress:Port        Forward   Weight   ActiveConn    InActConn
 FWM  99 rr persistent 3600
      ->             Masq      1        0             0
      ->             Masq      1        0             0

The output of this command shows that packets with a Netfilter marked (fwmarked) FWM value of decimal 99 will be sent to real servers and for processing using the Masq forwarding method (recall from Chapter 11 that Masq refers to the LVS-NAT forwarding method).

Notice the flexibility and power we have when using iptables to mark packets that we can then forward to real servers with ipvsadm forwarding and scheduling rules. You can use any criteria available to the iptables utility to mark packets when they arrive on the Director, at which point these packets can be sent via LVS scheduling and routing methods to real servers inside the cluster for processing.

The iptables utility can select packets based on the source address, destination address, or port number and a variety of other criteria. See Chapter 2 for more examples of the iptables utility. (The same rules that match packets for acceptance into the system that were provided in Chapter 2 can also match packets for marking when you use the -t mangle option and the PREROUTING chain.) Also see the iptables man page for more information or search the Web for "Rusty's Remarkably Unreliable Guides."

[14]Again, this depends on your licensing scheme.

[15]The value in the "expire" column.

[16]Controlled by the TIME_WAIT timeout value in the IPVS code. If you are using secure tcp to prevent a DoS attack as discussed earlier in this chapter, the default TIME_WAIT timeout value is one minute

[17]In normal FTP connections, the server connects back to the client to send or receive data. FTP can also switch to a passive mode so that the client can connect to the server to send or receive data.

[18]To avoid the time-consuming process of searching through the connection tracking table for all of the connections that are no longer valid and removing them (because a persistent connection template cannot be removed until after the connection tracking records it is associated with expire), LVS simply enters 65535 for the port numbers in the persistent connection template entry and allows the connection tracking records to expire normally (which will ultimately cause the persistent connection template entry to expire).

[19]From one client computer.

[20]Regardless of the destination address in the packet header.

In Conclusion

You don't have to understand exactly how the LVS code hooks into the Linux kernel to build a cluster load balancer, but in this chapter I've provided you with a look behind the scenes at what is going on inside of the Director so you'll know at what points you can alter the fate of a packet as it passes through the Director. To balance the load of incoming requests across the cluster using an LVS Director, however, you will need to understand how LVS implements persistence and when to use it, and when not to use it.

Часть 15: Высоко-производительный кластер


In this chapter, we'll combine the Heartbeat package with the LVS software to build a high-availability LVS-DR cluster with no single point of failure that will serve as the foundation for our enterprise-class cluster. (We'll add a filesystem, a method of scheduling batch jobs, and application monitoring software in later chapters.)

In order to build a high-availability cluster we need to use redundant LVS Directors, and we must be able to automatically remove real servers (cluster nodes) when they fail. Before we discuss all of the design goals of a high-availability cluster, we'll examine these two basic requirements.


See the Keepalived project for another approach to building a high-availability cluster.

Redundant LVS Directors

A high-availability LVS cluster has two LVS Directors: a primary and a backup. When the two Directors first boot, the primary Director owns the cluster load-balancing resource (the VIP and the LVS forwarding and scheduling rules). The secondary (backup) Director listens to the heartbeats coming from the primary Director. If the secondary Director does not hear the primary server's heartbeat, it initiates a failover and takes ownership of the cluster load-balancing resource by doing the following:

  • Adding the VIP address to one of its NICs.

  • Sending GARP broadcasts to let other machines on the network know that it now has the VIP.

  • Creating an IPVS table to load balance incoming requests for (virtual) services.

  • Possibly shutting off the power (Stonith) to the primary Director.

High-Availability Cluster Design Goals

To be highly available with no single point of failure, our cluster must handle these situations in the following ways:

  • If the Primary Director no longer responds to requests for cluster resources on its public network interface, the secondary Director should take over and shut off or reset the power to the Primary Director (to avoid a split-brain condition).

  • If one of the real servers (cluster nodes) inside the cluster no longer responds to requests for its services, the real server should be removed from the cluster.

  • If the Director is operating in LocalNode[1] mode and a locally running daemon stops responding, the Director should stop routing requests for cluster resources locally and forward all requests to real servers inside the cluster instead.[2]

A highly available cluster should also:

  • Use separate physical switches for the primary and secondary Director in order to avoid a single point of failure with the network hardware.

  • Use a highly available NAS server in order to avoid a single point of failure with the shared file system. (See Chapter 16 for a discussion of a shared storage technique that address lock arbitration problems.)[3]


    Many LVS clusters will not need to use NAS for shared storage. See Chapter 20 for a discussion of SQL database servers and a brief overview of the Zope project—two examples of sharing data that do not use NAS.

  • Use a hardened Linux distribution such as EnGarde or LIDS ( in order to protect the real servers when the real servers are connected to the Internet.


See "LVS defense strategies against DoS attack" on the LVS website for ways to drop inbound requests in order to avoid overwhelming an LVS Director when it is under attack. Building a cluster that uses these techniques is outside the scope of this book.

[1]LocalNode was introduced in Chapter 12.

[2]Although this is normally a very unlikely scenario.

[3]Your NAS vendor will have a redundancy solution for this problem as well.

The High-Availability LVS-DR Cluster

In this chapter, we will describe how to build a Linux Enterprise Cluster that uses the LVS-DR forwarding method in conjunction with the Heartbeat package for high availability.

A diagram of a highly available LVS cluster with two Directors is shown in Figure 15-1. The Directors, real servers, and client computers are all connected to the same network segment. The short dotted arrow connecting the LVS Director and the backup LVS Director in this figure represents a separate physical connection for heartbeats.

Image from book
Figure 15-1: A highly available DR cluster

Using this configuration, the heartbeats can pass over the normal Ethernet network and a dedicated network or serial cable connection between the LVS Director and the backup LVS Director. Recall from Chapter 6 that heartbeats should be able to travel over two physically independent paths between the servers in a high-availability server pair.[4]

This diagram also shows a Stonith device (as described in Chapter 9) that provides true high availability. The backup node has a serial or network connection to the Stonith device so that Heartbeat can send power reset commands to it. The primary node has its power cable plugged into the Stonith device.[5]

The dashed gray arrows that originate from the LVS Director and point to each cluster node represent incoming packets from the client computers. Because this is an LVS-DR cluster, the reply packets are sent directly back out onto the Ethernet network by each of the cluster nodes and directly back to the client computers connected to this Ethernet network.

As processing needs grow, additional cluster nodes (LVS real servers) can easily be added to this configuration,[6] though two machines outside the cluster will always act as a primary and backup Director to balance the incoming workload.

[4]Avoid the temptation to put heartbeats only on the real or public network (as a means of service monitoring), especially when using Stonith as described in this configuration. See Chapter 9 for more information

[5]See Chapter 9 for a discussion of the limitations imposed on a high availability configuration that does not use two Stonith devices.

[6]Using the SystemImager package described in Chapter 5.

Introduction to ldirectord

To failover the LVS load-balancing resource from the primary Director to the backup Director and automatically remove nodes from the cluster, we need to use the ldirectord program. This program automatically builds the IPVS table when it first starts up and then it monitors the health of the cluster nodes and removes them from the IPVS table (on the Director) when they fail.

How ldirectord Monitors Cluster Nodes (LVS Real Servers)

The ldirectord daemon monitors the health of real servers by sending requests to access cluster resources on the real IP (RIP) of each real server. This is true for all LVS cluster types: LVS-DR, LVS-TUN, and LVS-NAT. Normally, one ldirectord daemon runs for each VIP address on the Director. When a real server does not reply to the ldirectord daemon running on the Director, the ldirectord daemon issues the correct ipvsadm command to remove it from the IPVS table for the VIP address. (Later, when the real server comes back online, ldirectord issues the correct ipvsadm command to add the real server back into the IPVS table.)

To monitor real servers inside a web cluster, for example, the ldirectord daemon uses the HTTP protocol to ask each real server for a specific web page. The Director knows what it should receive back from the real servers if they are healthy. If the reply string, or web page, takes too long to come back from the real server, or never comes back at all, or comes back changed in some way, ldirectord will know something is wrong, and it will remove the node from the IPVS table of valid real servers for a given VIP.

Figure 15-2 shows the ldirectord request to view a special health check web page (not normally viewed by client computers) on the first real server.

Image from book
Figure 15-2: ldirectord requests a health check URL

The URL in this figure ( relies on a file named .healthcheck.html on real server 1 at RIP The Apache web server daemon running on real server 1 will use this file (which contains only the word OKAY) to reply to the ldirectord daemon as shown in Figure 15-3.

Image from book
Figure 15-3: Real server 1 sends back the reply

Notice in Figure 15-2 that the destination address of the packet is the RIP address of a real server (real server 1). Also notice that the return address of the packet is the Director's IP address (the DIP).

In Figure 15-3, the packet sent back from the real server to the Director's DIP is shown.

If anything other than the expected reply (the word OKAY as shown in this example) is received by the ldirectord daemon on the Director, the real server will be removed from the IPVS table.

Health checks for cluster resources must be made on the RIP addresses of the real servers inside the cluster regardless of the LVS forwarding method used. In the case of the HTTP protocol, the Apache web server running on the real servers must be listening for requests on the RIP address. (Appendix F contains instructions for configuring Apache in a cluster environment.) Apache must be configured to allow requests for the health check web page or health check CGI script on the RIP address.


Normally, the Director sends all requests for cluster resources to VIP addresses on the LVS-DR real servers, but for cluster health check monitoring, the Director must use the RIP address of the LVS-DR real server and not the VIP address.

Because ldirectord is using the RIP address to monitor the real server and not the VIP address, the cluster service might not be available on the VIP address even though the health check web page is available at the RIP address.

LVS, Heartbeat, and ldirectord Recipe

In this recipe, we'll build the highly available LVS-DR cluster shown in Figure 15-1.


If you haven't already built an LVS-DR according to the instructions in Chapter 13, you should do so before you attempt to follow this recipe.

List of ingredients:

  • Kernel with LVS support and the ability to suppress ARP replies (2.4.26 or greater in the 2.4 series, 2.6.4 or greater in the 2.6 series)

  • Copy of the Heartbeat RPM package

  • Copy of the ldirectord Perl program

Hide the Loopback Interface

We need to tell the real servers to ignore ARP broadcasts from client computers that are searching for the owner (the MAC address) of the VIP (see the section "ARP Broadcasts and the LVS-DR Cluster" in Chapter 13 for an explanation of why this is required in an LVS-DR cluster).

There are two commonly used methods to accomplish this, and you can use either one: either create a script that is launched by the init program when the system boots (see Chapter 1), or modify the /etc/sysctrl.conf file and issue the sysctl -p command each time the system boots.[7] The chapter15 subdirectory on the CD-ROM contains a copy of the following sample init script that will let you use the first method. The script adds the VIP address to the hidden loopback device, adds a routing table entry[8] for the VIP, enables packet forwarding, and tells the kernel to hide the VIP.


This script can be used on real servers and on the primary and backup Director because the Heartbeat program[9] knows how to remove a VIP address on the hidden loopback device when it conflicts with an IP address specified in the /etc/ha.d/haresources file. In other words, the Heartbeat program will remove the VIP from the hidden loopback device when it first starts up[10] on the primary Director, and it will remove the VIP from the hidden loopback device on the backup Director when the backup Director needs to acquire the VIP (because the primary Director has crashed).

 # lvsdrrs init script to hide loopback interfaces on LVS-DR
 # Real servers. Modify this script to suit
 # your needs—You at least need to set the correct VIP address(es).
 # Script to start LVS DR real server.
 # chkconfig: 2345 20 80
 # description: LVS DR real server
 # You must set the VIP address to use here:
 case "$1" in
        # Start LVS-DR real server on this machine.
         /sbin/ifconfig lo down
         /sbin/ifconfig lo up
         echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
         echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce
         echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
         echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
         /sbin/ifconfig lo:0 $VIP netmask up[11]
         /sbin/route add -host $VIP dev lo:0
         # Stop LVS-DR real server loopback device(s).
         /sbin/ifconfig lo:0 down
         echo 0 > /proc/sys/net/ipv4/conf/lo/arp_ignore
         echo 0 > /proc/sys/net/ipv4/conf/lo/arp_announce
         echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
         echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
         # Status of LVS-DR real server.
         islothere=`/sbin/ifconfig lo:0 | grep $VIP`
         isrothere=`netstat -rn | grep "lo:0" | grep $VIP`
         if [ ! "$islothere" -o ! "isrothere" ];then
             # Either the route or the lo:0 device
             # not found.
             echo "LVS-DR real server Stopped."
             echo "LVS-DR Running."
             # Invalid entry.
             echo "$0: Usage: $0 {start|status|stop}"
             exit 1

Place this script in the /etc/init.d directory on all of the real servers and enable it using chkconfig, as described in Chapter 1, so that it will run each time the system boots.


The Linux Virtual Server website ( also contains an installation script that will build an init script similar to this for you automatically.

Install the Heartbeat on a Primary and a Backup Director

We will use Heartbeat to start ldirectord and bring up the VIP IP alias or secondary IP address on the primary Director. The ldirectord program and the VIP will be one resource group (see Chapter 8) under Heartbeat's control. If the primary Director crashes, Heartbeat running on the backup Director will take over this resource group and client computers will continue to be able to access the cluster.

Install ldirectord and its Required Software Components

ldirectord is a Perl program that relies on several prewritten Perl modules that help it to connect to the real servers using HTTP, POP, telnet, and so on. The following sections discuss how to install the necessary Perl modules.

Install Perl Modules Located on the CD-ROM

The chapter15/ldirectord/dependencies directory on the CD-ROM in this book includes the current Perl module dependencies for ldirectord from the Ultra Monkey website.[12] Use the following commands to extract and build each Perl module.

 #mkdir -p /usr/local/src/perldeps
 #mount /mnt/cdrom
 #cp -r /mnt/cdrom/chapter15/ldirectord/dependencies /usr/local/src/perldeps
 #cd /usr/local/src/perldeps
 #tar xzvf libnet*
 #cd libnet-<version>
 #perl Makefile.PL
 #make install

Download Perl Modules From CPAN

If you choose not to install from the CD-ROM, you can download the necessary Perl modules for free from the Comprehensive Perl Archive Network or CPAN ( Here is a list of the protocols and services that ldirectord can monitor, as well as the associated CPAN Perl modules required to monitor them.


PERL Module


LWP::UserAgent, (libwww-perl)


Net::SSLeay, (Net_SSLeay)


Net::FTP (libnet)


Net::LDAP (perl-ldap)


IO::Socket/IO::Select (already part of Perl)


Net::SMTP (libnet)


Net::POP3 (libnet)


Mail::IMAPClient (Mail-IMAPClient)

Start the CPAN program with the commands:[13]

 #cd /
 #perl -MCPAN -e shell

If you've never run this command before, it will ask if you want to manually configure CPAN. Say no if you want to have the program use its defaults.

You should install the CPAN module MD5 to perform security checksums on the download files the first time you use this method of downloading CPAN modules. While optional, it gives Perl the ability to verify that it has correctly downloaded a CPAN module using the MD5 checksum process:

 cpan> install MD5

The MD5 download and installation should finish with a message like this:

 /usr/bin/make install -- OK

Now install the CPAN modules with the following case-sensitve commands. You will need to respond to questions about your network configuration.


You do not need to install all of these modules for ldirectord to work properly. However, installing all of them allows you to monitor these types of services later by simply making a configuration change to the ldirectord configuration file.

 cpan> install LWP::UserAgent
 cpan> install Net::SSLeay
 cpan> install Mail::IMAPClient
 cpan> install Net::LDAP
 cpan> install Net::FTP
 cpan> install Net::SMTP
 cpan> install Net::POP3

Some of these packages will ask if you want to run tests to make sure that the package installs correctly. These tests will usually involve using the protocol or service you are installing to connect to a server. Although these tests are optional, if you choose not to run them, you may need to force the installation by adding the word force at the beginning of the install command. For example:

 cpan> force install Mail::IMAPClient

Each command should complete with the following message to indicate a successful installation:

 /usr/bin/make install -- OK

ldirectord does not need to monitor every service your cluster will offer. If you are using a service that is not supported by ldirectord, you can write a CGI program that will run on each real server to perform health or status monitoring and then connect to each of these CGI programs using HTTP.

If you want to use the LVS configure script (not described in this book) to configure your Directors, you should also install the following modules:

 cpan> install Net::DNS
 cpan> install strict
 cpan> install Socket

When you are done with CPAN, enter the following command to return to your shell prompt:


The Perl modules we just downloaded should have been installed under the /usr/local/lib/perl<MAJOR-VERSION-NUMBER>/site_perl/<VERSION-NUMBER> directory. Examine this directory to see where the LWP and Net directories are located, and then run the following command.

 #perl -V

The last few lines of output from this command will show where Perl looks for modules on your system, in the @INC variable. If this variable does not point to the Net and SWP directories you downloaded from CPAN, you need to tell Perl where to locate these modules. (One easy way to update the directories Perl uses to locate modules is to simply upgrade to a newer version of Perl.)


In rare cases, you may want to modify Perl's @INC variable. See the description of the -I option on the perlrun man page (type man perlrun), or you can also add the -I option to the ldirectord Perl script after you have installed it.

Install ldirectord

You'll find ldirectord in the chapter15 directory on the CD-ROM. To install this version enter:

 #mount /mnt/cdrom
 #cp /mnt/cdrom/chapter15/ldirectord* /etc/ha.d/resource.d/ldirectord
 #chmod 755 /usr/sbin/ldirectord

You can also download and install ldirectord from the Linux HA CVS source tree (, or the ldirectord website ( Be sure to place ldirectord in the /usr/ sbin directory and to give it execute permission.


Remember to install the ipvsadm utility (located in the chapter12 directory) before you attempt to run ldirectord.

Test Your ldirectord Installation

To test the installation of ldirectord, ask it to display its help information with the -h switch.

 #/usr/sbin/ldirectord -h

Your screen should show the ldirectord help page. If it does not, you will probably see a message about a problem finding the necessary modules.[14] If you receive this message, and you cannot or do not want to download the Perl modules using CPAN, you can edit the ldirectord program and comment out the Perl code that calls the methods of healthcheck monitoring that you are not interested in using.[15] You'll need at least one module, though, to monitor your real servers inside the cluster.

Create the ldirectord Configuration File

ldirectord uses a configuration file to build the IPVS table. You can call this file any legal file name you wish, but you must place it in the /etc/ha.d/conf directory. For example, the configuration file for the IPVS on VIP can be:

     real= gate 1 ".healthcheck.html", "OKAY"
     real= gate 1 ".healthcheck.html", "OKAY"

You must indent the lines after the virtual line with at least four spaces or the tab character.

The first four lines in this file are "global" settings; they apply to multiple virtual hosts. When used with Heartbeat, however, this file should normally contain virtual= sections for only one VIP address, as shown here. This is so because when you place each VIP in the haresources file on a separate line, you run one ldirectord daemon for each VIP and use a different configuration file for each ldirectord daemon. Each VIP and its associated IPVS table thus becomes one resource that Heartbeat can manage.

Let's examine each line in this configuration file.


This sets the checktimeout value to the number of seconds ldirectord will wait for a health check (normally some type of network connection) to complete. If the check fails for any reason or does not complete within the checktimeout period, ldirectord will remove the real server from the IPVS table.[16]


This checkinterval is how long ldirectord sleeps between checks.


This enables the autoreload option, which causes the ldirectord program to periodically calculate an md5sum to check this configuration file for changes and automatically apply them when you change the file. This handy feature allows you to easily change your cluster configuration. A few seconds after you save changes to the configuration file, the running ldirectord daemon will see the change and apply the proper ipvsadm commands to implement the change, removing real servers from the pool of available servers or adding them in as needed.[17]


You can also force ldirectord to reload its configuration file by sending the ldirectord daemon a HUP signal (with the kill command) or by running ldirectord reload.


A node is "quiesced" (its weight is set to 0) when it fails to respond within its checktimeout period. When you set this option, ldirectord will remove the real server from the IPVS table rather than "quiesce" it. Removing the node from the IPVS table breaks all of the existing client connections and causes LVS to drop all connection tracking records and persistent connection templates for the node. If you do not set this option to no, the cluster may appear to be down to some of the client computers when a node crashes, because they were assigned to the node before it crashed and the connection tracking record or persistent connection template still remains on the Director.

When using this option, you may also want to use a command[18] like this at boot time:

 echo 1 > /proc/sys/net/ipv4/vs/expire_nodest_conn

Setting this kernel variable to 1 causes the connection tracking records to expire immediately if a client with a pre-existing connection tracking entry tries to talk to the same server again but the server is no longer available.[19]


All of the sysctl variables, including the expire_nodest_conn variable, are documented at the LVS website (


This entry tells ldirectord to use the syslog facility for logging error messages. (See the file /etc/syslog.conf to find out where messages at the "info" level are written.) You can also enter the name of a directory and file to write log messages to (precede the entry with /). If no entry is provided, log messages are written to /var/log/ldirectord.log.


This line specifies the VIP addresses and port number that we want to install on the LVS Director. Recall that this is the IP address you will likely add to the DNS and advertise to client computers. In any case, this is the IP address client computers will use to connect to the cluster resource you are configuring.

You can also specify a Netfilter mark (or fwmark) value on this line instead of an IP address. For example, the following entry is also valid:


This entry indicates that you are using ipchains or iptables to mark packets as they arrive on the LVS Director[20] based on criteria that you specify with these utilities. All packets that contain this mark will be processed according to the rules on the indented lines that follow in this configuration file.


Packets are normally marked to create port affinity (between ports 443 and port 80, for example). See Chapter 14 for a discussion of packet marking and port affinity.

The next group of indented lines specifies which real servers inside the cluster will be able to offer the resource to client computers.

 real= gate 1 ".healthcheck.html", "OKAY"

This line indicates that the Director itself (at loopback IP address, acting in LocalNode mode, will be able to respond to client requests destined for the VIP address


Do not use LocalNode in production unless you have thoroughly tested failing over the cluster load-balancing resource. Normally, you will improve the reliability of your cluster if you avoid using LocalNode mode.

 real= gate 1 ".healthcheck.html", "OKAY"

This line adds the first LVS-DR real server using the RIP address

Each real= line in this configuration file uses the following syntax:

 real=RIP:port gate|masq|ipip [weight] "Request URL", "Response Expected"

This syntax description tells us that the word gate, masq, or ipip must be present on each real line in the configuration file to indicate the type of forwarding method used for the real server. (Recall from Chapter 11 that the Director can use a different LVS forwarding method for each real server inside the cluster.) This configuration file is therefore using a slightly different terminology (based on the switch passed to the ipvsadm command) to indicate the three forwarding methods. See Table 15-1 for clarification.

Table 15-1: Different Terms for the LVS Forwarding Methods

ldirectord Configuration Option

Switch Used When Running ipvsadm

Output of ipvsadm -L Command

LVS Forwarding Method













Following the forwarding method on each real= line is the weight assigned to the real servers in the cluster. This is used only with one of the weighted scheduling methods. The last two parameters on the line indicate which web page or URL ldirectord should ask for when checking the health of the real server, and what response ldirectord should expect to receive from the real server. Both of these parameters need to be in quotes as shown, separated by a comma.


This line indicates which service ldirectord should use when testing the health of the real server. You must have the proper CPAN Perl module loaded for the type of service you specify here.


This line indicates that the health check request for the http service should be performed on port 80.


This entry specifies the protocol that will be used by this virtual service. The protocol can be set to tcp, udp, or fwm. If you use fwm to indicate marked packets or fwmarked marked packets, then you must have already used the Netfilter mark number (or fwmark) instead of an IP address on the virtual= line that all of these indented lines are associated with.


The scheduler line indicates that we want to use the weighted round-robin load-balancing technique (see the previous real lines for the weight assignment given to each real server in the cluster). See Chapter 11 for a description of the scheduling methods LVS supports. The ldirectord program does not check the validity of this entry; ldirectord just passes whatever you enter here on to the ipvsadm command to create the virtual service.


This option describes which method the ldirectord daemon should use to monitor the real servers for this VIP. checktype can be set to one of the following:


  • This method connects to the real server and sends the request string you specify. If the reply string you specify is not received back from the real server within the checktimeout period, the node is considered dead. You can specify the request and reply strings on a per-node basis as we have done in this example (see the discussion of the real lines earlier in this section), or you can specify one request and one reply string that should be used for all of the real servers by adding two new lines to your ldirectord configuration file like this:



  • This method simply connects to the real server at the specified checkport (or port specified for the real server) and assumes everything is okay on the real server if a basic TCP/IP connection is accepted by the real server. This method is not as reliable as the negotiate method for detecting problems with the resource daemons (such as HTTP) running on the real server. Connect checks are useful when there is no negotiate check available for the service you want to monitor.

A number

  • If you enter a number here instead of the word negotiate or connect, the ldirectord daemon will perform the connection test the number of times specified, and then perform one negotiate test. This method reduces the processing power required of the real server when responding to health checks and reduces the amount of cluster network traffic required for health check monitoring of the services on the real servers.[21]


  • This disables ldirectord's health check monitoring of the real servers.


The fallback address is the IP address and port number that client connections should be redirected to when there are no real servers in the IPVS table that can satisfy their requests for a service. Normally, this will always be the loopback address, in order to force client connections to a daemon running locally that is at least capable of informing users that there is a problem, and perhaps letting them know whom to contact for additional help.

You can also specify a special port number for the fallback web page:


We did not use persistence to create our virtual service table. Persistence is enabled in the ldirectord.conf file using the persistent= entry to speciify the number of seconds of persistence. See Chapter 14 for a discussion of LVS persistence.

Create the Health Check Web Page

Now, on each real server and on the Director enter a command like this to create a simple health check web page:

 #echo "OKAY" > /www/htdocs/.healthcheck.html

The directory used here should be whatever is specified as the DocumentRoot in your httpd.conf file. Also note the period (.) before the file name in this example.[22]

See if the Healthcheck web page displays properly on each real server with the following command:

 #lynx -dump

Start ldirectord Manually and Test Your Configuration

In a moment we'll use Heartbeat to start our ldirectord monitoring daemon, but first you need to try starting the daemon on the Director to see if your IPVS table is created.

  1. Enter this command:

     #/etc/ha.d/resource.d/ldirectord -d ldirectord- start

    You should see the ldirectord debug output indicating that the ipvsadm commands have been run to create the virtual server followed by commands such as the following:

     DEBUG2: check_http: is down
  2. Now if you bring the real server online at IP address, and it is running the Apache web server, the debug output message from ldirectord should change to:

     DEBUG2: check_http: is up
  3. Check to be sure that the virtual service was added to the IPVS table with the command:

     #ipvsadm -L -n
  4. Once you are satisfied with your ldirectord configuration and see that it is able to communicate properly with your health check URLs, press CTRL-C to break out of the debug display.

  5. Restart ldirectord (in normal mode) with:

     #/usr/sbin/ldirectord ldirectord- start
  6. Then start ipvsadm with the watch command to automatically update the display:

     #watch ipvsadm -L -n

    You should see an ipvsadm table like this:

     IP Virtual Server version x.x.x (size=4096)
     Prot LocalAddress:Port Scheduler Flags
       -> RemoteAddress:Port     Forward    Weight    ActiveConn    InActConn
     TCP wrr
       ->     Route      1         0             0
       ->           Local      1         0             0
  7. Test ldirectord by shutting down Apache on the real server or disconnecting its network cable. Within 20 seconds, or the checktimeout value you specified, the real server's weight should be set to 0 so that no future connections will be sent to it.

  8. Turn the real server back on or restart the Apache daemon on the real server, and you should see its weight return to 1.

Add ldirectord to the Heartbeat Configuration

Now it's time to tell Heartbeat about our ldirectord configuration so that it will start it automatically at boot time.

  1. Follow the recipe in Chapter 7 to install the Heartbeat program if you have not done so already; then create the following entry in your /etc/ ha.d/haresources file: ldirectord::ldirectord-

    where is the host name (as returned by uname -n) of your primary Director.

    This haresources entry tells Heartbeat that the machine named should normally own the ldirectord resource for VIP The ldirectord resource, or script, is started by Heartbeat, and based on this entry, passed to the configuration file named /etc/ha.d/conf/ldirectord- in this example.

  2. Stop ldirectord with the command:

     #/etc/ha.d/resource.d/ldirectord ldirectord- stop
  3. Make sure ldirectord is no longer running with:

     #ps -elf | grep ldirectord

    This command should not return any output if ldirectord is not running. If you see a running process here, be sure to kill it before going on with the next command.

  4. Make sure ldirectord is not started as a normal boot script with the command:

     #chkconfig --del ldirectord
  5. Stop Heartbeat with the command:

     #/etc/rc.d/init.d/heartbeat stop
  6. Clean out the ipvsadm table so that it does not contain any test configuration information with:

     #ipvsadm -C
  7. Now start the heartbeat daemon with:

     #/etc/rc.d/init.d/heartbeat start

    and watch the output of the heartbeat daemon using the following command:

     #tail -f /var/log/messages

For the moment, we do not need to start Heartbeat on the backup Heartbeat server.

The last line of output from Heartbeat should look like this:

 heartbeat: info: Running /etc/ha.d/resource.d/ldirectord ldirectord- start

You should now see the IPVS table constructed again, according to your configuration in the /etc/ha.d/conf/ldirectord- conf file. (Use the ipvsadm -L -n command to view this table.)

To finish this recipe, install the ldirectord program on the backup LVS Director, and then configure the Heartbeat configuration files exactly the same way on both the primary LVS Director and the backup LVS Director. (You can do this manually or use the SystemImager package described in Chapter 5.)

Stateful Failover of the IPVS Table

As we've just seen, when the primary Director crashes and ldirectord needs to rebuild the IPVS table on the backup Director it can do so because you have placed all of the ipvsadm configuration rules into an ldirectord configuration file. However, the active client connections (the connection tracking records) are not re-created by ldirectord on the backup Director at failover time. All client computer connections to the cluster nodes are lost during a failover using the recipe we have just built.

These connections (entries in the connection tracking table) change rapidly on a heavily loaded cluster as client connections come and go, so we need a method of sending connection tracking records from a primary Director to a backup Director as they change. The LVS programers developed a technique to do this (replicate the connection tracking table to the backup director) using multicast packets. This technique was originally called the server sync state daemon, and even though it was implemented inside the kernel (the server sync state daemon does not run in userland) the name stuck. To turn on the server sync state daemon, as it is called, inside the kernel run the following command on the primary Director:

 /sbin/ipvsadm --start-daemon master

Then, on the backup Director, run the command:

 /sbin/ipvsadm --start-daemon backup

The primary and backup Directors must be able to communicate with each other using multicast packets on multicast address for the master server sync state daemon to announce changes to the connection tracking records, and for the backup server sync state daemon to hear about these changes and insert them into its idle connection tracking table. To find out if your cluster nodes support multicast, see the output of the ifconfig command and look for the word MULTICAST. It should be present for each interface that will be using multicast; if it isn't you'll need to recompile your kernel and provide support for multicast (see Chapter 3).[23]


To stop the sync state daemon (either the master or the backup) you can issue the command ipvsadm --stop-daemon.

Once you have issued these commands on the primary and backup Director (you'll need to add these commands to an init script so the system will issue the command each time it boots—see Chapter 1), your primary Director can crash, and then when ldirectord rebuilds IPVS table on the backup Director all active connections[24] to cluster nodes will survive the failure of the primary Director.


The method just described will failover active connections from the primary Director to the backup Director, but it does not failback the active connections when the primary Director has been restored to normal operation. To failover and failback active IPVS connections you will need (as of this writing) to apply a patch to the kernel. See the LVS HOWTO ( for details.

Modifications to Allow Failover to a Real Server Inside the Cluster

We have finished building our pair of highly available load balancers (with no single point of failure). However, before we leave the topic of failing over the Director as a resource under Heartbeat's control, let's look at how the Director failover works in conjunction with the LVS LocalNode mode.

Because a Linux Enterprise Cluster will have a relatively small number of incoming requests for cluster services (perhaps only one or two for each employee[25]), dedicating a server to the job of load balancing the cluster may be a waste of resources. We can fully utilize the CPU capacity on the Director by making the Director a node inside the cluster; however, this may reduce the reliability of the cluster and should be avoided, if possible.[26] When we tell one of the nodes inside the cluster to be the Director, we will also need to tell one of the nodes inside the cluster to be a backup Director if the primary Director fails.

Heartbeat running on the backup Director will start the ldirectord daemon, move all of the VIP addresses, and re-create the IPVS table in order to properly route client requests for cluster resources if the primary Director fails, as shown in Figure 15-4.

Image from book
Figure 15-4: Failure of primary Director

Notice in Figure 15-4 that the backup Director is also replying to client computer requests for cluster services—before the failover the backup Director was just another real server inside the cluster that was running Heartbeat. (In fact, before the failover occurred the primary Director was also replying to its share of the client computer requests that it was processing locally using the LocalNode feature of LVS.)

Although no changes are required on the real servers to inform them of the failover, we will need to tell the backup LVS Director that it should no longer hide the VIP address on its loopback device. Fortunately, Heartbeat does this for us automatically, using the IPaddr2 script[27] to see if there are hidden loopback devices configured on the backup Director with the same IP address as the VIP address being added. If so, Heartbeat will automatically remove the hidden, conflicting IP address on the loopback device and its associated route in the routing table. It will then add the VIP address on one of the real network interface cards so that the LVS-DR real server can become an LVS-DR Director and service incoming requests for cluster services on this VIP address.


Heartbeat will remember that it has done this and will restore the loopback addresses to their original values when the primary server is operational again (the VIP address fails back to the primary Heartbeat server).

[7]This method is also discussed on the Ultra Monkey website.

[8]See Chapter 2 for a discussion of the routing table.

[9]The IPaddr and IPaddr2 resource scripts.

[10]This assumes that the Heartbeat program is started after this script runs.

[11]Use netmask here no matter what netmask you normally use for your network.

[12]They originally came from CPAN.

[13]You can also use Webmin as a web interface or GUI to download and install CPAN modules.

[14]Older versions of ldirectord would only display the help page if you ran the program as a non-root user. If you see a security error message when you run this command, you should upgrade to at least version 1.46.

[15]Making custom changes to get ldirectord to run in a test environment such as the one being described in this chapter is fine, but for production environments you are better off installing the CPAN modules and using the stock version of ldirectord to avoid problems in the future when you need to upgrade.

[16]The checktimeout and checkinterval values for ldirectord have nothing to do with the keepalive, deadtime, and initdead variables used by Heartbeat.

[17]If you simply want to be able to remove real servers from the cluster for maintenance and you do not want to enable this feature, you can disconnect the real server from the network, and the real server will be removed from the IPVS table automatically when you have configured ldirectord properly.

[18]Or set this value in the /etc/sysctl.conf file on Red Hat systems.

[19]The author has not used this kernel setting in production, but found that setting quiescent to "no" provided an adequate node removal mechanism.

[20]The machine running ldirectord will also be the LVS Director. Recall that packet marks only apply to packets while they are on the Linux machine that inserted the mark. The packet mark created by iptables or ipchains is forgotten as soon as the packet is sent out the network interface.

[21]How much it reduces network traffic is based on the type of service and the size of the response packet that the real server must send back to verify that it is operating normally.

[22]This is not required, but is a carryover from the Unix security concept that file names starting with a period (.) are not listed by the ls command (unless the -a option is used) and are therefore hidden from normal users.

[23]You may also be able to specify the interface that the server sync state daemon should use to send or listen for multicast packets using the ipvsadm command (see the ipvsadm man page); however, early reports of this feature indicate that a bug prevents you from explicitly specifying the interface when you issue these commands.

[24]All connections that have not been idle for longer than the connection tracking timeout period on the backup Director (default is three minutes, as specified in the LVS code).

[25]Compare this to the thousands of requests that are load balanced across large LVS web clusters.

[26]It is easier to reboot real servers than the Director. Real servers can be removed from the cluster one at a time without affecting service availability, but rebooting the Director may impact all client computer sessions.

[27]The IPaddr script, if you are using ip aliases.

In Conclusion

An enterprise-class cluster must have no single point of failure. It should also know how to automatically detect and remove a failed cluster node. If a node crashes in a Linux Enterprise Cluster, users need only log back on to resume their session. If the Director crashes, a backup Director will take ownership of the load-balancing resource and VIP address that client computers use to connect to the cluster resources.

Часть 16: Network File System


The Network File System (NFS) looks like a single, unified filesystem to users and application programs running on the cluster nodes. A key feature of NFS is its ability to share data with all of the cluster nodes while providing a locking mechanism to protect the integrity of the data. To build a cluster that does not require any changes to existing multiuser application programs, this locking mechanism must support the locking methods used by programs that are unaware of the fact that they are running on a cluster node. That is, NFS and its locking mechanism should be transparent to the existing multiuser applications. This transparency of the shared filesystem is one of the defining characteristics of the Linux Enterprise Cluster—applications see the cluster as a single, unified computing resource.

In this chapter, we'll examine what the term lock means, what locking methods are normally used on multiuser operating systems, and how NFS allows existing legacy, multiuser applications that are running on cluster nodes to share access to data. (If you do not need to run legacy Unix applications on your Linux Enterprise Cluster, you can skip over most of the locking information in this chapter.) I'll also introduce you to a few performance issues you will face when using NFS and provide you with an example of the NFS client configuration options you can use on each node inside the cluster.


In Chapter 20, we'll look at how a database system such as MySQL or Postgres can be used in a cluster environment.

Lock Arbitration

When an application program reads data from stable storage (such as a disk drive) and stores a copy of that data in a program variable or system memory, the copy of the data may very soon be rendered inaccurate. Why? Because a second program may, only microseconds later, change the data stored on the disk drive, leaving two different versions of the data: one on disk and one in the program's memory. Which one is the correct version?

To avoid this problem, well-behaved programs will first ask a lock arbitrator to grant them access to a file or a portion of a file before they touch it. They will then read the data into memory, make their changes, and write the data back to stable storage. Data accuracy and consistency are guaranteed by the fact that all programs agree to use the same lock arbitration method before accessing and modifying the data.

Because the operating system does not require a program to acquire a lock before writing data,[1] this method is sometimes called cooperative, advisory, or discretionary locking. These terms suggest that programs know that they need to acquire a lock before they can manipulate data. (We'll use the term cooperative in this chapter.)

The cooperative locking method used on a monolithic multiuser server should be the same as the one used by applications that store their data using NFS: the application must first acquire a lock on a file or a portion of a file before reading data and modifying it. The only difference is that this lock arbitration method must work on multiple systems that share access to the same data.

[1]Except in the case of mandatory locks.

The Lock Arbitrator

We can divide cooperative lock arbitration into three categories based on who or what is acting as the lock arbitrator:

Kernel lock arbitration

  • The program asks the kernel for a lock on a particular file or a portion of a file. This is perhaps the most common method used by programmers when developing applications such as an order-entry system in a typical multiuser environment such as Linux or Unix.

File lock arbitration

  • The program creates a new file called a lock file or a dotlock file on stable storage to indicate that it would like exclusive access to a data file. Programs that access the same data must look for this file or examine its contents.[2] This method is typically used in order to avoid subtle differences in kernel lock arbitration implementations (in different versions of Unix for example) when programmers want to develop software that will run on a variety of different operating systems.

External lock arbitration daemon

  • The program asks a daemon to track lock and unlock requests. This type of locking is normally used for shared storage. Examples of this type of lock arbitration include sophisticated database applications, distributed lock managers (that can run on more than one node at the same time and share lock information), and the Network Lock Manager (NLM) used in NFSv3, which will be discussed shortly.


Additional lock arbitration methods for cluster file systems are provided in Appendix E.

Our cluster environment should use a cooperative lock arbitration method that works in conjunction with shared storage while still honoring the classic Unix or Linux kernel lock arbitration requests (so we do not have to rewrite all of the user applications that will share data in the cluster). Fortunately, Linux supports a shared storage external lock arbitration daemon called lockd. But before we get into the details of NFS and lockd, let's more closely examine the kernel lock arbitration methods used by existing multiuser applications.

The Existing Kernel Lock Arbitration Methods

The three most common cooperative kernel lock arbitration methods available on Linux are BSD flock, System V lockf, and Posix fcntl.

BSD Flock

This is considered an antiquated method of locking files because it only supports locking an entire file, not a range of bytes (called a byte offset) within a file. There are two types of flocks: shared and exclusive. As their names imply, many processes may hold a shared lock on one file, but only one process may hold an exclusive lock. (A file cannot be locked with both shared and exclusive locks at the same time.) Because this method only supports locking the entire file, your existing multiuser applications are probably not using this locking mechanism to arbitrate access to shared user data. We'll discuss this further in "The Network Lock Manager (NLM)."

System V lockf

The System V locking method is called lockf. On a Linux system, lockf is really just an interface to the Posix fcntl method discussed next.

Posix fcntl

The Posix-compliant fcntl system call used on Linux does a variety of things, but for the moment we are only interested in the fact that it allows a program to lock a byte-range portion of a file.[3] Posix fcnlt locking supports the same two types of locks as the BSD flock method but uses the terms read and write instead of shared and exclusive. Multiple read locks are allowed at the same time, but when a program asks the kernel for a write lock, no other programs may hold either type of lock (read or write) for the same range of bytes within the file.

When using Posix fcnlt, the programmer decides what will happen when a lock is denied or blocked by the kernel by deciding if the program should wait for the call to fcntl to succeed (wait for the lock to be granted). If the choice is to wait, the fcntl call does not return and the kernel places the request[4] into a blocked queue. When the program that was holding the lock releases it and the reason for the blocked lock no longer exists, the kernel will reply to the fcntl request, waking up the program from a "sleep" state with the good news that its request for a lock has been granted.

For example, three processes may all hold a Posix fcntl read lock on the byte range 1,000–1,050 of the same file at the same time. But when a fourth process comes along and wants to acquire a Posix fcntl write lock to the same range of bytes, it will be placed into the blocked queue. As long as any one of the original three programs continues to hold its read lock, the fourth program's request for a write lock will remain in the blocked queue. However, if a fifth program asks for a Posix fcntl read lock, it will be granted it immediately. Now the fourth program is still waiting for its write lock and it may have to wait forever if new processes keep coming along and acquiring new read locks. This complexity makes it more difficult to write a program that uses both Posix fcntl read and Posix fcntl write locks.

Also of note, when considering Posix fcntl locks:

  1. If a process holds a Posix fcntl lock and forks a child process, the child process (running under a new process id or pid) will not inherit the lock from its parent.

  2. A program that holds a lock may ask the kernel for this same lock a second time and the kernel will "grant" it again without considering this a conflict. (More about this in a moment when we discuss the Network Lock Manager and lockd.)

  3. The Linux kernel currently does not check for collisions with Posix fcntl byte-range locks and BSD file flocks. The two locking methods do not know about each other.


There are at least three additional kernel lock arbitration methods available under Linux: whole file leases, share modes (similar to Windows share modes), and mandatory locks. If your application relies on one of these methods for lock arbitration, you must use NFS version 4.

Now that we've talked about the existing kernel lock arbitration methods that work on a single, monolithic server, let's examine a locking method that allows more than one server to share locking information: the Network Lock Manager.

[2]The PID of the process that created the dotlock file is usually placed into the file.

[3]If the byte range covers the entire file, then this is equivalent to a whole file lock.

[4]Identified by the process id of the calling program as well as the file and byte range to be locked.

The Network Lock Manager (NLM)

The NFS was originally conceived and built to manage distributed access to a single data storage device, with separate programs added to provide a cooperative lock arbitration method. Because the filesystem and its locking method were implemented separately, the original NFS developers felt that they were providing a generic means of network lock arbitration that could be used in conjunction with any network file system. However, the network lock manager (NLM), as it became known, has only been widely used as a lock arbitration method by NFSv3 servers and NFS version 3 clients. The NFSv4 protocol does not use a separate daemon or protocol for locking, but many of the concepts are the same, so we'll introduce NFS locking by discussing the NFSv3 NLM.

The NLM consists of two daemons called statd and lockd. Both daemons are supposed to be running all the time on the NFS server and on all NFS clients to ensure that everyone agrees upon what is locked and which programs or processes own the locks. The statd daemon running on both the client and the server keeps a list of the hosts that are currently holding or granting locks (NFS clients, if the machine is acting as an NFS server, and NFS servers, if the machine is an NFS client[5]). The lockd daemon on the NFS client is in charge of making a lock request over the network by talking to the lockd daemon running on the NFS server.

Let's examine these two daemons more closely.


  • In a cluster, statd, also called rpc.statd, runs on each cluster node in case the node crashes while it holds a lock on the NFS server. If this happens, the rpc.statd program on the cluster node will notify the NFS server when the cluster node is rebooted that it is now operational again. statd on the NFS client knows that it should do this, because it writes the name of each NFS server to a local disk drive the first time a process running on the cluster node tries to lock a file on the NFS server. When the NFS server receives this notification after an NFS client reboots, it assumes all of the processes that were holding locks from this cluster node (or NFS client) are no longer running, and it releases them.


statd or rpc.statd is sometimes described as an implementation of a network status monitor (NSM).


  • The Linux kernel knows when a Posix fcntl lock request is for a file or portion of a file stored on an NFS server, because the file is stored on an NFS-mounted filesystem. If the Linux machine is an NFS client, the kernel knows it needs to pass the lock request to the locally running lockd daemon.[6] The lockd daemon on the NFS client then contacts the lockd daemon running on the NFS server and requests the lock. The lockd daemon on the NFS server then asks its own kernel for the lock on behalf of the NFS client.

    If the same program or process asks for the same lock[7] more than once, it will be granted the lock each time; the kernel doesn't consider this to be a violation of the lock. Before lockd became a part of the kernel, the ability for a single process to be granted the same lock more than once meant that the lockd daemon had to track lock requests that it had made on behalf of the NFS clients and deny any conflicts before asking the kernel for the lock. Were it not to do so, lockd could unknowingly ask for, and be granted, the same lock twice on behalf of two different NFS clients.

[5]Actually, it always keeps a list of both if the machine is both an NFS client and an NFS server.

[6]The lockd daemon actually runs inside the kernel as a kernel thread on Linux and is called klockd, but for our purposes, we'll talk about it as if it were outside the kernel.

[7]At least this is true for Posix fcntl locks in NFS version 3.

NLM and Kernel Lock Arbitration

Let's look at how NLM works in conjunction with the kernel lock arbitration methods on the NFS clients (the cluster nodes). Remember that we want to support existing multiuser applications that use one or more of the existing kernel lock arbitration methods so that we do not have to rewrite user applications to run them on the cluster.

NLM and Kernel BSD Flock

The Linux kernel currently does not pass BSD flock requests for whole file locks to the NLM. As such, this method of file locking will not work on the Linux Enterprise Cluster when access to shared data is required across all cluster nodes.

Because BSD flocks can only lock whole files, your existing multiuser applications aren't likely to use them to share user data. BSD flocks are more commonly used by applications that fork child processes in order to prevent the child processes from doing things that would cause conflicts with each other. For example, the LPRng printing system creates child processes for sending print jobs, and the child processes create temporary control files in the /var/spool/lpd directory to ensure that print jobs are sent in the correct order.

In fact, most daemons that use BSD flocks create temporary files in a subdirectory underneath the /var directory, so when you build a Linux Enterprise Cluster you should not use a /var directory over NFS. By using a local /var directory on each cluster node, you can continue to run existing applications that use BSD flocks for temporary files.[8]


In the Linux Enterprise Cluster, NFS is used to share user data, not operating system files.

NLM and Kernel System V lockf

The System V lockf method is a wrapper around the Posix fcntl method of lock arbitration. See the next section for details.

NLM and Kernel Posix fcntl

When an application running on an NFS client issues an fcntl lock operation on data that is mounted over an NFS filesystem, several things happen:

  1. Any file data or file attribute information stored in the local NFS client computer's cache is flushed.

  2. The NFS server is contacted[9] to check the attributes of the file again, in part to make sure that the permissions and access mode settings on the file still allow the process to gain access to the file.

  3. lockd on the NFS server is contacted to determine if the lock will be granted.

  4. If this is the first lock made by the particular NFS client, the statd daemon on both the NFS client and the NFS server records the fact that this client has made a lock request.

Posix fcntl locks that use the NLM (for files stored using NFS) are very slow when compared to the ones granted by the kernel. Also notice that due to the way the NFS client cache works, the only way to be sure that the data inside your application program's memory is the same as the data stored on the NFS server is to lock the data before reading it. If your application reads the data without locking it, another program can modify the data stored on the NFS server and render your copy (which is stored in your program's memory) inaccurate.

We'll discuss the NLM performance issues more shortly.

[8]When you use the LPRng printing system, printer arbitration is done on a single print spool server (called the cluster node manager in a Linux Enterprise Cluster), so lock arbitration between child LPRng printing processes running on the cluster nodes is not needed. We'll discuss LPRng printing in more detail in Chapter 19.

[9]Using the GETATTR call.

NFS and File Lock (dotlock) Arbitration

Recall from our discussion earlier in this chapter that applications that use the lock arbitration method (called file lock arbitration or dotlocking) check for a dotlock file[10] as a means of arbitrating access to a data file. Because NFS clients are normally configured to use locally cached data that may not be the same as the data stored on the NFS server, the dotlock file arbitration method must be implemented to work over NFS.

In order to use dotlock file arbitration, the programmer must use a method of creating dotlock files that ensures that the dotlock file has been stored on the NFS server and is not just sitting in the local NFS client's cache. This is accomplished (as described on the "open" man page) by creating the dotlock file, linking to it, and then checking to make sure that the link command worked or that the count of the number of links associated with the dotlock file has increased to 2.

Apache, sendmail, and LPRng all use dotlock file lock arbitration. However, none of these use dotlock arbitration to share writable data in a normal Linux Enterprise Cluster configuration.


A CGI program running under Apache on a cluster node may share access to writable data with a CGI program running on another cluster node, but the CGI programs can implement their own locking method independent of the dotlock method used by Apache and thus avoid the complexity of implementing dotlock file locking over NFS. For an example, search CPAN[11] for "File::NFSLock."

[10]Or they check the contents of the dotlock file to see if a conflicting PID already owns the dotlock file.

[11]See Chapter 15 for how to search and download CPAN modules.

Finding the Locks Held by the Linux Kernel

We will now leave the theoretical discussion of the NLM and its interaction with the kernel lock arbitration methods and discuss system administration and maintenance activities as they relate to kernel and NLM lock arbitration.

To see what Posix fcntl and BSD flocks are currently granted by the kernel, examine the contents of the /proc/locks file. Although you can do this on any Linux system, this technique is perhaps most useful for troubleshooting locking problems on cluster nodes that have NFS-mounted data.

To view the current locks, first cat the /proc/locks kernel file:

 #cat /proc/locks
 1  2      3         4     5     6  7  8       9 10  11       12       13       14       15
 1: POSIX  ADVISORY  WRITE 14555 21:06:69890   0 EOF cd4f87f0 c02d9fc8 cd4f8570 00000000 cd4f87fc
 2: POSIX  ADVISORY  WRITE 14005 21:06:69889   0 EOF cd4f87f0 cd4f873c cd4f8ca0 00000000 cd4f87fc
 3: POSIX  ADVISORY  WRITE 8508  21:07:56232   0 EOF cd4f8f7c cd4f873c cd4f8908 00000000 cd4f8f88
 4: FLOCK  ADVISORY  WRITE 12435 22:01:2387398 0 EOF cd4f8904 cd4f8f80 cd4f8ca0 00000000 cd4f8910
 5: FLOCK  ADVISORY  WRITE 12362 22:01:1831448 0 EOF cd4f8c9c cd4f8908 cd4f83a4 00000000 cd4f8ca8

I've added column numbers to this output to help make the following explanation easier to understand.

Column 1 displays the lock number and column 2 displays the locking mechanism (either flock or Posix fcntl). Column 3 contains the word advisory or mandatory; column 4 is read or write; and column 5 is the PID (process ID) of the process that owns the lock. Columns 6 and 7 are the device major and minor number of the device that holds the lock,[12] while column 8 is the inode number of the file that is locked. Columns 9 and 10 are the starting and ending byte range of the lock, and the remaining columns are internal block addresses used by the kernel to identify the lock.

To learn more about a particular lock, use lsof to examine the process that holds it. For example, consider this entry from above:

 1: POSIX ADVISORY WRITE 14555 21:06:69890 0 EOF cd4f87f0 c02d9fc8 cd4f8570 00000000 cd4f87fc

The PID is 14555, and the inode number is 69890. (See boldface text.) Using lsof, we can examine this PID further to see which process and file it represents:

 #lsof -p 14555

The output of this command looks like this:

 lockdemo  14555  root   cwd  DIR    33,6      2048    32799   /root
 lockdemo  14555  root   rtd  DIR    33,6      2048    2       /
 lockdemo  14555  root   txt  REG    33,6      14644   69909   /tmp/lockdemo
 lockdemo  14555  root   mem  REG    33,6      89547   22808   /lib/
 lockdemo  14555  root   mem  REG    33,6      1401027 26741   /lib/i686/
 lockdemo  14555  root   0u   CHR    136,3             5       /dev/pts/3
 lockdemo  14555  root   1u   CHR    136,3             5       /dev/pts/3
 lockdemo  14555  root   2u   CHR    136,3             5       /dev/pts/3
 lockdemo  14555  root   3u   REG    33,6      1024    69890   /tmp/mydata

The last line in this lsof report tells us that the inode 69890 is called /tmp/ mydata and that the name of the command or program that is locking this file is called lockdemo.[13]

To confirm that this process does, in fact, have the file open, enter:

 #fuser /tmp/mydata

The output gives us the file name followed by any PIDs that have the file open:

 /tmp/mydata: 14555

To learn more about the locking methods used in NFSv3, see RFC 1813, and for more information about the locking methods used in NFSv4, see RFC 3530. (Also see the Connectathon lock test program on the "Connectathon Test Suites" link at

[12]You can see the device major and minor number of a filesystem by examining the device file associated with the device (ls -l /dev/hda2, for example).

[13]The source code for this program is currently available online at

Performance Issues with NFS—Bottlenecks and Perceptions

Now that we've examined the locking issues required to build a cluster filesystem that can support existing multiuser applications, we need to examine how well the NLM and NFS perform. The two basic areas of concern when analyzing the performance of a NFS are:

  1. Where are the performance bottlenecks?

  2. How will the end-user perceive the overall NFS performance?

Like an engineer building a suspension bridge, a cluster architect wants to know how NFS will bear the load of the traffic that it will carry. As such, we need to analyze the amount of traffic and know where the stress points or bottlenecks are likely to occur. We also need to know the difference between the raw numbers used to describe the speed of an NAS device and the end-user's perception of how well the system as a whole performs.

We'll first look at the size of the pipes used to send and receive data over the network. As shown in Figure 16-1, the size of the network "pipe" connecting the NAS server to the Ethernet backbone will normally support a gigabit per second (Gbps). Also, as shown in Figure 16-1, the cluster node has two network connections. The first is a 100 Mbps pipe to the network that connects the cluster node's real IP address (RIP) to the Director's IP (DIP), called the D/RIP network in LVS terms. The second network connection is used to connect the cluster node to a network that is dedicated to NFS traffic (labeled the "NFS Network" in Figure 16-1).

Image from book
Figure 16-1: Ethernet performance bottlenecks

The Gigabit pipe on the NAS server and the switched Ethernet backbone (which can be hundreds of Gbps) for the NFS network are not likely to be the first stress points of performance. In fact, most NAS vendors support multiple Gigabit Ethernet pipes to the Ethernet backbone. Cluster nodes can also be built with multiple Gigabit pipes to the Ethernet backbone, though most multiuser applications running in a cluster environment will not saturate a single 100 Mbps pipe. (Remember that one of the benefits of using a cluster is that more nodes can always be added if too many instances of the same application or service on a single node are competing for access to the 100 Mbps pipe to the Ethernet backbone.)[14]

Where, then, is the likely stress point in a shared filesystem? To answer this question, we turn our attention to the end-user's perception of the overall NFS performance.


In the following discussion, the term database refers to flat files or an indexed sequential access method (ISAM) database used by legacy applications. For a discussion of relational and object databases (MySQL, Postgres, ZODB), see Chapter 20.

Single Transactions and User Perception of NFS Performance

For example, consider users who enter orders into a system and who need to run reports against this data. Here's what happens: A customer calls, and a customer service agent pulls up the customer's record. The agent then starts a new order and types in information. Up to this point, the difference between clustered NFS disk performance at 100 Mbps and locally attached storage is negligible.

The agent searches for a desired item (product, airline seat, classified ad, and so forth). If the application can find the data the agent is looking for using a key such as a product number or a flight number, the performance hit created by the NFS overhead is negligible because the application does not need to bring a significant amount of information over the network to complete the search.

Next, the agent selects the item to modify. Now, a Posix fcntl lock operation ensures that another agent cannot select the same item or product. As with the above, the NFS locking overhead on a single byte range within a file is not likely to create a noticeable performance difference as far as the agent is concerned.

Once the agent has picked his or her item, the database quantity is updated to reflect the change in inventory, the lock is released, and perhaps a new record is added to another database for the line item in the order. This process is repeated until the order is complete and a final calculation is performed on the order (total amount, tax, and so on). Again, this calculation is likely to be faster on a cluster node than it would be on a monolithic server because more CPU time is available to make the calculation.

Given this scenario, or one like it, there is little chance that the extra overhead imposed by NFS will impact the customer service agent's perception of the cluster's performance.[15] But what about users running reports?

Multiple Transactions and User Perception of NFS Performance

The situation is different for the user running reports or multiple transactions in a batch. This user's perception of the cluster's performance is likely to depend upon how quickly the NAS server can perform I/O operations and how many lock and GETATTR operations are required for each read or write transaction.


Historically, one drawback to using NAS servers (and perhaps a contributing factor for the emergence of storage area networks) was the CPU overhead associated with packetizing and de-packetizing filesystem operations so they could be transmitted over the network. However, as inexpensive CPUs push past the multi-GHz speed barrier, the packetizing and de-packetizing performance overhead fades in importance. And, in a cluster environment, you can simply add more nodes to increase available CPU cycles if more are needed to packetize and de-packetize Network File System operations (of course, this does not mean that adding additional nodes will increase the speed of NFS by itself).

The speed of your NAS server (the number of I/O operations per second) will probably be determined by your budget. Discussing all of the issues involved with selecting the best NAS hardware is outside the scope of this book, but one of the key performance numbers you'll want to use when making your decision is how many NFS operations the NAS server can perform per second. The NAS server performance is likely to become the most expensive performance bottleneck to fix in your cluster.[16]

The second performance stress point—the quantity of lock and GETATTR operations—can be managed by careful system administration and good programming practices.

[14]The cost to add additional nodes may be less than the cost to build multiple Gigabit pipes to each cluster node.

[15]Unless, of course, the NAS server is severely overloaded. See Chapter 18.

[16]At the time of this writing, a top-of-the-line NAS server can perform a little over 30,000 I/O operations per second.

Managing Lock and GETATTR Operations in a Cluster Environment

You've seen lock operations; GETATTR operations are simply file operations that examine the attributes of a file (who owns it, what time it was last modified, and so forth[17]). Lock and GETATTR operations require the NFS client to communicate with the NAS server using the get attribute or GETATTR call of the NFS protocol. This brings us to the topic of attribute caching— a method you can use to improve the performance of GETATTR calls.

[17]All inode information about the file.

Managing Attribute Caching

The file metadata information that NFS clients read using the NFS GETATTR call can be read from a cached copy of this information stored on the NFS client instead of from the NFS server.


Some applications such as Oracle's 9iRAC technology require you to disable attribute caching, so the noac option discussed next may not be available to improve NFS client performance under all circumstances.

To see the difference in performance between GETATTR operations that use the NFS server and GETATTR calls that use cached attribute information, mount the NFS filesystem with the noac option on the NFS client and then run a typical user application with the following command (also on the NFS client):

 #watch nfsstat -c

Watch the value for "getattr" on this report increase as the application performs GETATTR operations to the NFS server. Now remove the noac option, and run the database report again. If the change in value for the getattr number is significantly lower this time, you should leave attribute caching turned on.


Most applications with I/O-intensive workloads will perform approximately 40 percent more slowly when attribute caching is turned off.

The performance of the attribute caching is also affected by the timeout values you specify for the attribute cache. See the options acregmin, acregmax, acdirmin, acdirmax, and actimeo on the nfs man page.


Even with attribute caching turned on, the NFS client (cluster node) will remove attribute information from its cache each time the file is closed. This is called close-to-open or cto cache consistency. See the nocto option on the nfs man page.

Managing Interactive User Applications and Batch Jobs in a Cluster Environment

In the previous discussion, you saw two types of users: interactive users who perform many small I/O operations from the keyboard, and users who run batch jobs and database reports.[18] Each type of user places different demands on the cluster node CPUs and on the NAS server. The first type wants to use the CPU and the NAS server briefly and intermittently during normal business hours, while the second wants to grab as many CPU cycles and NAS I/O operations as possible for long periods of time. To satisfy both types, the cluster administrator can isolate each type of user application on two different cluster nodes.


Historically, on a monolithic server, the Unix system administrator reduced the impact of batch jobs (CPU-and I/O-hungry programs) by running them at night or by starting them with the nice command. The nice command allows the system administrator to lower the priority of batch programs and database reports relative to other processes running on the system by reducing the amount of the CPU time slice they could grab. On a Linux Enterprise Cluster, the system administrator can more effectively accomplish this by allocating a portion of the cluster nodes to users running reports and allocating another portion of the cluster to interactive users (because this adds to the complexity of the cluster, it should be avoided if possible).

Run Batch Jobs Outside the Cluster

The NAS server, however, is still a single point of contention for these two types of applications, so you'll want to purchase the fastest NAS server you can afford. Another way to reduce the contention for the NAS server, if your budget doesn't allow you to purchase one that is fast enough to handle both types of users at the same time, is to use the rsync utility to make a nightly copy of the data stored on the NAS server and place this copy of the data onto local storage on an old server. You can then run batch jobs that do not require up-to-the-minute data on this server (month-end accounting reports can use the rsync snapshot of yesterday's data, for example).

Use Multiple NAS Servers

Of course, you can also eliminate the NAS server as a single point of contention by using more than one NAS server and spreading your data across two or more NAS servers. For example, the central data center for a multi-warehouse facility could store the data for each warehouse on a separate NAS server.

However, if your budget allows, the easiest configuration for the system administrator to build and maintain is the one that lets users store all of their data in one place: on a single NAS server. And how fast does the NAS server need to be? In the next two sections, we will look at a few of the methods you can use to evaluate a NAS server when it will be used to support the cluster nodes running legacy applications.

[18]Possibly batch jobs that run outside of the normal business hours.

Measuring NFS Latency

To find out how long your application program is idle and waiting for the network and the NAS server, you can run a simulation[19] or a batch program preceded by the time command. For example, let's say you have a batch program called mytest in the /usr/local/bin directory that will perform read and write operations using data stored on the NAS server.[20] To test its performance, start it from a shell prompt with the command:

 #time /usr/local/bin/mytest

When the mytest program finishes executing you will receive a report such as the following:

 real   0m7.282s
 user   0m1.080s
 sys    0m0.060s

This report indicates the amount of wall clock time (real) the mytest program took to complete, followed by the amount of CPU user and CPU system time it used. Add the CPU user and CPU system time together, and then subtract them from the first number, the real time, to arrive at a rough estimate for the latency[21] of the network and the NAS server.


This method will only work for programs (batch jobs) that perform the same or similar filesystem operations throughout the length of their run (for example, a database update that requires updating the same field in all of the database records that is performed inside of a loop). The method I've just described is useful for arriving at the average I/O requirements of the entire job, not the peak I/O requirements of the job at a given point in time.

You can use this measurement as one of the criteria for evaluating NAS servers, and also use it to study the feasibility of using a cluster to replace your monolithic Unix server. Keep in mind, however, that in a cluster, the speed and availability of the CPU may be much greater than on the monolithic server.[22] Thus, even if the latency of the NAS server is greater than locally attached storage, the cluster may still outperform the monolithic box.

[19]The Expect programming language is useful for simulating keyboard input for an application.

[20]You can use a Linux box as an NFS server for testing purposes. Alternatively, if your existing data resides on a monolithic Unix server, you can simply use your monolithic server as an NFS server for testing purposes. See the discussion of the async option later in this chapter.

[21]The time penalty for doing filesystem operations over the network to a shared storage device.

[22]Because processor speeds increase each year, I'm making a big assumption here that your monolithic server is a capital expenditure that your organization keeps in production for several years—by the end of its life cycle, a single processor on the monolithic server is likely to be much slower than the processor on a new and inexpensive cluster node.

Measuring Total I/O Operations

The second measurement you'll need to make before you shop for a NAS solution is the total number of I/O operations per second that your applications will require at peak processing times. To find this number, use a Linux machine that is not doing anything else (preferably the hardware you will use for your cluster nodes) and run the nfsstat -c command. Sample output for this command looks like this:

 Client rpc stats:
 calls      retrans    authrefrsh
 667813156   38          0
 Client nfs v2:
 null       getattr    setattr     root       lookup     readlink
 0       0% 0       0% 0        0% 0       0% 0       0% 0       0%
 read       wrcache    write       create     remove     rename
 0       0% 0       0% 0        0% 0       0% 0       0% 0       0%
 link       symlink    mkdir       rmdir      readdir    fsstat
 0       0% 0       0% 0        0% 0       0% 0       0% 0       0%
 Client nfs v3:
 null       getattr    setattr     lookup     access     readlink
 0       0% 129765080  0% 523295   0% 10521226  1% 86507757  0% 46448  0%
 read       write      create      mkdir      symlink    mknod
 314127903  2% 23687043  4% 748925   0% 17      0% 225     0% 0       0%
 remove     rmdir      rename      link       readdir    readdirplus
 372452  0% 17      0% 390726  0%  4155    0% 563968  0% 0       0%
 fsstat     fsinfo     pathconf    commit
 2       0% 2       0% 0       0%  12997359  2%

Assuming you have configured this machine to use NFS version 3 (a sample /etc/fstab entry to do this is provided later in this chapter), you can add up all of the values listed under the Client nfs v3: section of this report and then run your sample application using the time command I just described. When the application finishes running, use the nfsstat -c command again, and add up the total number of NFS calls that were made. You can then simply subtract the first number from the second number to arrive at the total number of NFS operations your application performed. Divide this number by the total number of elapsed seconds your application ran (the first number returned by the time command) to determine the average NFS I/O operations per second your application uses. Now multiply this number by the total number of applications that will be running at peak system load (order processing deadline for example) to arrive at a very rough estimate of the total number of I/O operations you'll need on your NAS server. (For a more accurate picture of the total number of I/O operations per second you will require, perform the same analysis using a batch job, and add this number to your total as well.)

Now let's examine a few of the options you have for configuring your NFS environment to achieve the best possible performance and reliability.

Achieving the Best NAS Performance Possible

Now that you know a few of the considerations you'll need to make before moving your legacy applications on to a cluster filesystem that uses the NFS protocol, note a few of the things you can do to make your cluster filesystem perform optimally:

  • Dedicate a network to isolate NFS traffic.

  • Use a high-quality NAS server from a NAS vendor (these NAS servers use nonvolatile RAM to commit NFS write operations as quickly as possible).

  • Play with the read and write sizes used on the NFS client (see "wsize" and "rsize" on the mount man page). When using NFS over TCP (discussed in a moment), the recommended read and write size is 32K. (The read and write size should always be larger than the NFS client's page size, which is usually 4K.) Sending larger read and write requests from an NFS client can significantly reduce the latency of NFS operations; however, you can't always control the size of NFS packets by simply changing these numbers (the application program that uses NFS may also need to be modified).

  • Use specialized networking techniques from your NAS vendor (such as trunking multiple network connections) to remove network performance bottlenecks.[23]


In an effort to boost the NFS server's performance, the current default Linux NFS server configuration uses asynchronous NFS. This lets a Linux NFS server (acting as an NAS device) commit an NFS write operation without actually placing the data on the disk drive. This means if the Linux NFS server crashes while under heavy load, severe data corruption is possible. For the sake of data integrity, then, you should not use async NFS in production. An inexpensive Linux box using asynchronous NFS will, however, give you some feel for how well an NAS system performs.[24]

Don't assume you need the added expense of a GigE network for all of your cluster nodes until you have tested your application using a high-quality NAS server.

In Chapter 18, we'll return to the topic of NFS performance when we discuss how to use the Ganglia monitoring package.

[23]Although, as we've already discussed the network is not likely to become a bottleneck for most applications

[24]Though async NFS on a Linux server may even outperform a NAS device.

NFS Client Configuration Options

Now let's examine the NFS mount options you can use on the cluster nodes.


The Linux NFS client support for TCP helps to improve NFS performance when network load would otherwise force an NFS client using UDP to resend its network packets. Because UDP performance is better when network load is light, and because NFS is supposed to be a connectionless protocol, it is tempting to use UDP, but TCP is more efficient when you need it most—when the system is under heavy load. To specify tcp on the NFS client, use mount option tcp.


To make sure you are using NFS version 3 and not version 2, you should also specify this in your mount options on the NFS clients. This is mount option vers=3.


To continue to retry the NFS operation and cause the system to not return an error to the user application performing the I/O, use mount option hard.


To prevent the system from booting when it cannot mount the filesystem, use mount option fg.

Putting it all Together

The NFS mount options just described can be used in an /etc/fstab entry that looks like this:

     nasserver:/clusterdata    /clusterdata   nfs rw, hard, nointr, tcp,
 vers=3, rsize=32k, wsize=32k, fg 0 0

This single line says to mount the filesystem called /clusterdata from the NAS server host named nasserver on the mount point called /clusterdata. The options that follow on this line specify: the type of filesystem is NFS (nfs), the type of access to the data is both read and write access (rw), the cluster node should retry failed NFS operations indefinitely (hard), the NFS operations cannot be interrupted (nointr), all NFS calls should use the TCP protocol instead of UDP (tcp), NFS version 3 should always be used (vers=3), the read (rsize) and write (wsize) size of NFS operations are 32K to improve performance, the system will not boot when the filesystem cannot be mounted (fg), and dump program does not need to back up the filesystem (0) and the fsck[25] program does not need to check the file system at boot time (0).

[25]A filesystem sanity check can be performed by the fsck program on all locally attached storage at system boot time

Developing NFS

The NFS protocol continues to evolve and develop with strong support from the industry. One example of this is the continuing effort that was started with NFSv4[26] file delegation to remove the NFS server performance bottleneck by distributing file I/O operations to NFS clients. With NFSv4, delegation of the applications running on an NFS client can lock different byte-range portions of the file without creating any additional network traffic to the NAS server. This has led to a development effort called NFS Extensions for Parallel Storage. The goal of this effort is to build a highly available filesystem with no single point of failure by allowing NFS clients to share file state information and file data without requiring communication with a single NAS device for each filesystem operation.

For more information about the NFS Extensions for Parallel Storage, see

[26]NFSv4 (see RFC 3530) will provide a much better security model and allow for better interoperability with CIFS. NFSv4 will also provide better handling of client crashes and robust request ordering semantics as well as a facility for mandatory file locking.

Additional Starting Points for Information on Linux and NFS

In this chapter, we have only scratched the surface of a complex topic: how to configure and use a NAS server. The focus has mostly been on lock information and basic performance issues to get you started thinking about the issues involved when you convert from a monolithic server to a Linux Enterprise Cluster that will run legacy multiuser applications. Here are additional resources to assist you:

  • When building an NFS client on Linux, begin by reading the Network Appliance Technical Article "TR3183."

  • The home page for the open source version of NFS is

  • Linux NFS performance tuning is discussed at

  • Also, for Linux NFS performance tuning, see the "Linux NFS-HOWTO" at


    See especially the online transaction processing (OLTP) benchmark, which was patterned after the widely used TCP-C benchmark.

  • An excellent discussion of the methods for troubleshooting NFS problems is available in the book Linux NFS and Automounter Administration by Erez Zadok.

In Conclusion

If you are deploying a new application, your important data—the data that is modified often by the end-users—will be stored in a database such as Postgres, MySQL, or Oracle. The cluster nodes will rely on the database server (outside the cluster) to arbitrate write access to the data, and the locking issues I've been discussing in this chapter do not apply to you.[27] However, if you need to run a legacy application that was originally written for a monolithic Unix server, you can use NFS to hide (from the application programs) the fact that the data is now being shared by multiple nodes inside the cluster. The legacy multiuser applications will acquire locks and write data to stable storage the way they have always done without even knowing that they are running inside of a cluster—thus the cluster, from the inside as well as from the outside, appears to be a single, unified computing resource.

[27]To build a highly available SQL server, you can use the Heartbeat package and a shared storage device rather than a shared filesystem. The shared storage device (a shared SCSI bus, or SAN storage) is only mounted on one server at a time—the server that owns the SQL "resource"—so you don't need a shared filesystem.

Оставьте свой комментарий !

Ваше имя:
Оба поля являются обязательными

 Автор  Комментарий к данной статье