Search     or:     and:
 LINUX 
 Language 
 Kernel 
 Package 
 Book 
 Test 
 OS 
 Forum 
iakovlev.org

Часть 2: Разработка клиента для простых сервисов

Перл позволяет писать сетевые приложения , которые могут многое делать - от рассылки писем до доступа к веб-сервисам. Клиентские модули могут состоять как из нескольких строк , так и из нескольких тысяч. В этой части будут рассмотрены стандартные клиентские модули и показано , как их использовать для решения стандартных проблем. В их основе лежит Berkeley socket API , которое мы рассмотрели в предыдущей части.

Раздел 6. FTP и Telnet

Two of the oldest Internet protocols are the File Transfer Protocol, FTP, and Telnet, for remote login. They illustrate the two extremes of network protocols: An FTP session is a highly structured and predictable set of transactions; a Telnet session is unpredictable and highly interactive. Perl has modules that can tame them both.

Net::FTP

There's a directory on a remote FTP server that changes every few weeks. You want to mirror a copy of the directory on your local machine and update your copy every time it changes. You can't use one of the many "mirror" scripts to do this because the directory name contains a timestamp, and you need to do a pattern match to identify the right directory. Net::FTP to the rescue.

Net::FTP is part of the libnet utilities by Graham Barr. In addition to Net::FTP, libnet includes Net::SMTP, Net::NNTP, and Net::POP3 discussed in later chapters. When you install the libnet modules, the install script prompts you for various default configuration parameters used by the Net::* modules. This includes such things as an FTP firewall proxy and the default mail exchanger for your domain. See the documentation for Net::Config (also part of the libnet utilities) for information on how to override the defaults later.

Net::FTP, like many of the client modules, uses an object-oriented interface. When you first log in to an FTP server, the module returns a Net::FTP object to you. You then use this object to get directory listings from the server, to transfer files, and to send other commands.

A Net::FTP Example

Figure 6.1 is a simple example that uses Net::FTP to connect to ftp.perl.org and download the file named RECENT from the directory /pub/CPAN/. If the program runs successfully, it creates a file named RECENT in the current directory. This file contains the names of all files recently uploaded to CPAN.

Figure 6.1. Downloading a single file with Net::FTP

Lines 15: Initialize We load the Net::FTP module and define constants for the host to connect to and the file to download.

Line 6: Connect to remote host We connect to the FTP host by invoking the Net::FTP new() method with the name of the host to connect to. If successful, new() returns a Net::FTP object connected to the remote server. Otherwise, it returns undef, and we die with an error message. In case of failure, new() leaves a diagnostic error message in $@.

Line 7: Log in to the server After connecting to the server, we still need to log in by calling the Net::FTP object's login() method with a username and password. In this case, we are using anonymous FTP, so we provide the username "anonymous" and let Net::FTP fill in a reasonable default password. If login is successful, login() returns a true value. Otherwise, it returns false and we die, using the FTP object's message() method to retrieve the text of the server's last message.

Line 8: Change to remote directory We invoke the FTP object's cwd() ("change working directory") method to enter the desired directory. If this call fails, we again die with the server's last message.

Line 9: Retrieve the file We call the FTP object's get() method to retrieve the desired file. If successful, Net::FTP copies the remote file to a local one of the same name in the current directory. Otherwise we die with an error message.

Lines 1011: Quit We call the FTP object's quit() method to close the connection.

FTP and Command-Based Protocols

FTP is an example of a common paradigm for Internet services: the command-based protocol. The interaction between client and server is constrained by a well-defined protocol in which the client issues a single-line command and the server returns a line-oriented response.

Each of the client commands is a short case-insensitive word, possibly followed by one or more arguments. The command is terminated by a CRLF pair. As we saw in Chapter 5, when we used the gab2.pl script to communicate with an FTP server, the client commands in the FTP protocol include user and PASS, which together are used to log into the server; HELP, to get usage information; and QUIT, to quit the server. Other commands are used to send and retrieve files, obtain directory listings, and so forth. For example, when the client wishes to log in under the user name "anonymous," it will send this command to the server:

USER anonymous
 

Each response from the server to the client consists of one or more CRLF-delimited lines. The first line always begin with a three-digit numeric result code indicating the outcome of the command. This is usually followed by a human-readable message. For example, a successful USER command will result in the following server response:

331 Guest login ok, send your complete e-mail address as password.
 

Sometimes a server response will stretch over several lines. In this case, the numeric result code on the first line will end in a "-", and the result code will be repeated (without the dash) on the last line. The FTP protocol's response to the HELP command illustrates this:

HELP
 214-The following commands are recognized (* =>'s unimplemented).
    USER   PORT  STOR   MSAM*  RNTO  NLST  MKD   CDUP
    PASS   PASV  APPE   MRSQ*  ABOR  SITE  XMKD  XCUP
    ACCT*  TYPE  MLFL*  MRCP*  DELE  SYST  RMD   STOU
    SMNT*  STRU  MAIL*  ALLO   CWD   STAT  XRMD  SIZE
    REIN*  MODE  MSND*  REST   XCWD  HELP  PWD   MDTM
    QUIT   RETR  MSOM*  RNFR   LIST  NOOP  XPWD
 214 Direct comments to ftp-bugs@wuarchive.wustl.edu
 

Commonly the client and server need to exchange large amounts of non-command data. To do this, the client sends a command to warn the server that the data is coming, sends the data, and then terminates the information by sending a lone dot (".") on a line by itself. We will see an example of this in the next chapter when we examine the interaction between an e-mail client and an SMTP server.

Server result codes are arbitrary but generally follow a simple convention. Result codes between 100 and 199 are used for informational messages, while those in the 200299 range are used to indicate successful completion of a command. Codes in the 300399 range are used to indicate that the client must provide more information, such as the password that accompanies a username. Result codes of 400 or greater indicate various errors: the 400499 codes are used for client errors, such as an invalid command, while 500 and greater are used for server-side errors, such as an out of memory condition.

Because command-based servers are so common, the libnet package comes with a generic building block module called Net::Cmd. The module doesn't actually do anything by itself, but adds functionality to descendents of the IO::Socket module that allow them to easily communicate with this type of network server. Net::FTP, Net::SMTP, Net::NNTP, and Net::POP3 are all derived from Net::Cmd.

The two major methods provided by Net::Cmd objects are command() and response():

$success = $obj->command($command [,@args])

Send the command indicated by $command to the server, optionally followed by one or more arguments. command() automatically inserts spaces between arguments and appends a CRLF to the end of the command. If the command was delivered successfully, the method returns true.

$status = $obj->response

Fetches and parses the server's response to the last command, returning the most significant digit as the method result. For example, if the server's result code is 331, response() will return 3. It returns undef in case of failure.

Subclasses of Net::Cmd build more sophisticated methods on top of the command() and response(). For example, the Net::FTP login() method calls command() twice: once to issue the USER command and again to issue the PASS command. You will not ordinarily call command() and response(), yourself, but use the more specialized (and convenient) methods provided by the subclass. However, command() and response() are available should you need access to functionality that isn't provided by the module.

Several methods provided by Net::Cmd are commonly used by end-user applications. These are code(), message(), and ok():

$code = $obj->code

Returns the three-digit numeric result code from the last response.

$message = $obj->message

Returns the text of the last message from the server. This is particularly useful for diagnosing errors.

$ok = $obj->ok

The ok() method returns true if the last server response indicated success, false otherwise. It returns true if the result code is greater than 0 but less than 400.

The Net::FTP API

We'll now look at the Net::FTP API in greater detail. Net::FTP is a descendent of both IO::Socket and Net::Cmd. As a descendent of IO::Socket, it can be used as a filehandle to communicate directly with the server. For example, you canread and write to a Net::FTP object with syswrite() and sysread(), although you would probably not want to. As a descendent of Net::Cmd, Net::FTP supports the code(), message(), and ok() methods discussed in the previous section. The FTP protocol's status codes are listed in RFC 959 (see Appendix D).

To the generic methods inherited from its ancestors, Net::FTP adds a large number of specialized methods that support the special features of the FTP protocol. Only the common methods are listed here. See the Net::FTP documentation for the full API.

$ftp = Net::FTP->new($host [,%options])

The new() method creates a Net::FTP object. The mandatory first argument is the domain name of the FTP server you wish to contact. Additional optional arguments are a set of key/value pairs that set options for the session, as shown in Table 6.1. For example, to connect to ftp.perl.org with hash marks enabled and a timeout of 30 seconds, we could use this statement:

$ftp = Net::FTP('ftp.perl.org', Timeout=>30, Hash=>1);
 

Table 6.1. Net::FTP->new() Options
Option Description
Firewall Name of the FTP proxy to use when your machine is behind certain types of firewalls
BlockSize Block size of transfers (default 10240)
Port FTP port to connect to (default 21)
Timeout Timeout value, in seconds, for various operations (default 120 seconds)
Debug Debug level; set to greater than zero for verbose debug messages
Passive Use FTP passive mode for all file transfers; required by some firewalls
Hash Prints a hash mark to STDERR for each 1024 bytes of data transferred

$success = $ftp->login([$username [,$password [,$account]]])

The login() method attempts to log in to the server using the provided authentication information. If no username is provided, then Net::FTP assumes "anonymous". Ifno username or password is provided, then Net::FTP looks up the authentication information in the user's .netrc file. If this is still not found, it generates a password of the form "$user@", where $USER is your login name.

The optional $account argument is for use with some FTP servers that require an additional authentication password to gain access to the filesystem after logging into the server itself. login() returns true if the login was successful, and false otherwise.

See the Net::Netrc manual pages for more information on the .netrc file.

$type = $ftp->ascii

Puts the FTP object into ASCII mode. The server automatically performs newline translation during file transfers (ending lines with CRLF on Windows machines, LF on UNIX machines, and CR on Macintoshes). This is suitable for transferring text files.

The return value is the previous value of the transfer type, such as "binary." Note: ASCII mode is the default.

$type = $ftp->binary

Puts the FTP object into binary mode. The server will not perform translation. This is suitable for transferring binary files such as images.

$success = $ftp->delete($file)

Deletes the file $file on the server, provided you have sufficient privileges to do this.

$success = $ftp->cwd([$directory])

Attempts to change the current working directory on the remote end to the specified path. If no directory is provided, will attempt to change to the root directory " / ". Relative directories are understood, and you can provide a pathname of ".." to move up one level.

$directory = $ftp->pwd

Returns the full pathname of the current working directory on the remote end.

$success = $ftp->rmdir($directory)

Remove the specified directory, provided you have sufficient privileges to do so.

$success = $ftp->mkdir($directory [,$parents])

Creates a new directory at the indicated path, provided you have sufficient privileges to do so. If $parents is true, Net::FTP attempts to create all missing intermediate directories as well.

@items = $ftp->ls([$directory])

Gets a short-format directory list of all the files and subdirectories in the indicated directory or, if not specified, in the current working directory. In a scalar context, ls() returns a reference to an array rather than the list itself.

By default, each member of the returned list consists of just the bare file or directory name. However, since the FTP daemon just passes the argument to the ls command, you are free to pass command-line arguments to ls. For example, this returns a long listing:

@items = $ftp->ls('-lF');
 

@items = $ftp->dir([$directory])

Gets a long-format directory list of all the files and subdirectories in the indicated directory or, if not specified, in the current working directory. In a scalar context, dir() returns a reference to an array rather than the list itself.

In contrast to ls(), each member of the returned list is a line of a directory listing that provides the file modes, ownerships, and sizes. It is equivalent to calling the ls command with the -lg options.

$success = $ftp->get($remote [,$local [, $offset]])

The get() method retrieves the file named $remote from the FTP server. You may provide a full pathname or one relative to the current working directory.

The $local argument specifies the local pathname to store the retrieved file to. If not provided, Net::FTP creates a file with the same name as the remote file in the current directory. You may also pass a filehandle in $local, in which case the contents of the retrieved file are written to that handle. This is handy for sending files to STDOUT:

$ftp->get('RECENT',\*STDOUT)
 

The $offset argument can be used to restart an interrupted transmission. It gives a position in the file that the FTP server should seek before transmitting. Here's an idiom for using it to restart an interrupted transmission:

my $offset = (stat($file))[7] || 0;
 $ftp->get($file,$file,$offset);
 

The call to stat() fetches the current size of the local file or, if none exists, 0. This is then used as the offset to get().

$fh = $ftp->retr($filename)

Like get(), the retr() method can be used to retrieve a remote file. However, rather than writing the file to a filehandle or disk file, it returns a filehandle that can be read from to retrieve the file directly. For example, here is how to read the file named RECENT located on a remote FTP server without creating a temporary local file:

$fh = $ftp->retr('REMOTE') or die "can't get file ",$ftp->
 message;
 print while <$fh>;
 

$success = $ftp->put($local [,$remote])

The put() method transfers a file from the local host to the remote host. The naming rules for $local and $remote are identical to get(), including the ability to use a filehandle for $local.

$fh = $ftp->stor($filename)

$fh = $ftp->appe($filename)

These two methods initiate file uploads. The file will be stored on the remote server under the name $filename. If the remote server allows the transfer, the method returns a filehandle that can be used to transmit the file contents. The methods differ in how they handle the case of an existing file with the specified name. The stor(), method overwrites the existing file, and appe() appends to it.

$modtime = $ftp->mdtm($file)

The mdtm() method returns the modification time of the specified file, expressed as seconds since the epoch (the same format returned by the stat() function). If the file does not exist or is not a plain file, then this method returns undef. Also be aware that some older FTP servers (such as those from Sun) do not support retrieval of modification times. For these servers mdtm() will return undef.

$size = $ftp->size($file)

Returns the size of the specified file in bytes. If the file does not exist or is not a plain file, then this method returns undef. Also be aware that older FTP servers that do not support the SIZE command also return undef.

A Directory Mirror Script

Using Net::FTP, we can write a simple FTP mirroring script. It recursively compares a local directory against a remote one and copies new or updated files to the local machine, preserving the directory structure. The program preserves file modes in the local copy (but not ownerships) and also makes an attempt to preserve symbolic links.

The script, called ftp_mirror.pl, is listed in Figure 6.2. To mirror a file or directory from a remote server, invoke the script with a command-line argument consisting of the remote server's DNS name, a colon, and the path of the file or directory to mirror. This example mirrors the file RECENT, copying it to the local directory only if it has changed since the last time the file was mirrored:

Figure 6.2. The ftp_mirror.pl script
%ftp_mirror.pl ftp.perl.org:/pub/CPAN/RECENT
 						

The next example mirrors the entire contents of the CPAN modules directory, recursively copying the remote directory structure into the current local working directory (don't try this verbatim unless you have a fast network connection and a lot of free disk space):

%ftp_mirror.pl ftp.perl.org:/pub/CPAN/
 						

The script's command-line options include --user and --pass, to provide a username and password for non-anonymous FTP, --verbose for verbose status reports, and --hash to print out hash marks during file transfers.

Lines 15: Load modules We load the Net::FTP module, as well as File::Path and Getopt::Long. File::Path provides the mkpath() routine for creating a subdirectory with all its intermediate parents. Getopt::Long provides functions for managing command-line arguments.

Lines 619: Process command-line arguments We process the command-line arguments, using them to set various global variables. The FTP host and the directory or file to mirror are stored into the variables $HOST and $PATH, respectively.

Lines 2023: Initialize the FTP connection We call Net::FTP->new() to connect to the desired host, and login() to log in. If no username and password were provided as command-line arguments, we attempt an anonymous login. Otherwise, we attempt to use the authentication information to log in.

After successfully logging in, we set the file transfer type to binary, which is necessary if we want to mirror exactly the remote site, and we turn on hashing if requested.

Lines 2426: Initiate mirroring If all has gone well, we begin the mirroring process by calling an internal subroutine do_mirror() with the requested path. When do_mirror() is done, we close the connection politely by calling the FTP object's quit() method and exit.

Lines 2736: do_mirror() subroutine The do_mirror() subroutine is the main entry point for mirroring a file or directory. When first called, we do not know whether the path requested by the user is a file or directory, so the first thing we do is invoke a utility subroutine to make that determination. Given a path on a remote FTP server, find_type() returns a single-character code indicating the type of object the path points to, a "-" for an ordinary file, or a "d" for a directory.

Having determined the type of the object, we split the path into the directory part (the prefix) and the last component of the path (the leaf; either the desired file or directory). We invoke the FTP object's cwd() method to change into the parent of the file or directory to mirror.

If the find_type() subroutine indicated that the path is a file, we invoke get_file() to mirror the file. Otherwise, we invoke get_dir().

Lines 3753: get_file() subroutine This subroutine is responsible for fetching a file, but only if it is newer than the local copy, if any. After fetching the file, we try to change its mode to match the mode on the remote site. The mode may be provided by the caller; if not, we determine the mode from within the subroutine.

We begin by fetching the modification time and the size of the remote file using the FTP object's mdtm() and size() methods. Remember that these methods might return undef if we are talking to an older server that doesn't support these calls. If the mode hasn't been provided by the caller, we invoke the FTP object's dir() method to generate a directory listing of the requested file, and pass the result to parse_listing(), which splits the directory listing line into a three-element list consisting of the file type, name, and mode.

We now look for a file on the local machine with the same relative path and stat() it, capturing the local file's size and modification time information. We then compare the size and modification time of the remote file to the local copy. If the files are the same size, and the remote file is as old or older than the local one, then we don't need to freshen our copy. Otherwise, we invoke the FTP object's get() method to fetch the remote file. After the file transfer is successfully completed, we change the file's mode to match the remote version.

Lines 5473: get_dir() subroutine, recursive directory mirroring The get_dir(), subroutine is more complicated than get_file() because it must call itself recursively in order to make copies of directories nested within it. Like get_file(), this subroutine is called with the path of the directory and, optionally, the directory mode.

We begin by creating a local copy of the directory in the current working directory if there isn't one already, using mkpath() to create intermediate directories if necessary. We then enter the newly created directory with the chdir() Perl built-in, and change the directory mode if requested.

We retrieve the current working directory at the remote end by calling the FTP object's pwd() method. This path gets stored into a local variable for safekeeping. We now enter the remote copy of the mirror directory using cwd().

We need to copy the contents of the mirrored directory to the local server. We invoke the FTP object's dir() method to generate a full directory listing. We parse each line of the listing into its type, pathname, and mode using the parse_listing() subroutine. Plain files are passed to get_file(), symbolic_links() to make_link(), and subdirectories are passed recursively to get_dir().

Having dealt with each member of the directory listing, we put things back the way they were before we entered the subroutine. We call the FTP object's cwd(), routine to make the saved remote working directory current, and chdir('..') to move up a level in the local directory structure as well.

Lines 7484: find_type() subroutine find_type() is a not-entirely-satisfactory subroutine for guessing the type of a file or directory given only its path. We would prefer to use the FTP dir() method for this purpose, as in the preceding get_dir() call, but this is unreliable because of slight differences in the way that the directory command works on different servers when you pass it the path to a file versus the path to a directory.

Instead, we test whether the remote path is a directory by trying to cwd() into it. If cwd() fails, we assume that the path is a file. Otherwise, we assume that the path is a directory. Note that by this criterion, a symbolic link to a file is treated as a file, and a symbolic link to a directory is treated as a directory. This is the desired behavior.

Lines 8592: make_link() subroutine The make_link() subroutine tries to create a local symbolic link that mirrors a remote link. It works by assuming that the entry in the remote directory listing denotes the source and target of a symbolic link, like this:

README.html -> index.html
 

We split the entry into its two components and pass them to the symlink(), built-in. Only symbolic links that point to relative targets are created. We don't attempt to link to absolute paths (such as "/CPAN") because this will probably not be valid on the local machine. Besides, it's a security issue.

Lines 93106: parse_listing() subroutine The parse_listing() subroutine is invoked by get_dir() to process one line of the directory listing retrieved by Net::FTP->dir(). This subroutine is necessitated by the fact that the vanilla FTP protocol doesn't provide any other way to determine the type or mode of an element in a directory listing. The subroutine parses the directory entry using a regular expression that allows variants of common directory listings. The file's type code is derived from the first character of the symbolic mode field (e.g., the "d" in drwxr-xr-x), and its mode from the remainder of the field. The filename is whatever follows the date field.

The type, name, and mode are returned to the caller, after first converting the symbolic file mode into its numeric form.

Lines 107122: filemode() subroutine This subroutine is responsible for converting a symbolic file mode into its numeric equivalent. For example, the symbolic mode rw-r--r-- becomes octal 0644. We treat the setuid or setgid bits as if they were execute bits. It would be a security risk to create a set-id file locally.

When we run the mirror script in verbose mode on CPAN, the beginning of the output looks like the following:

% ftp_mirror.pl --verbose ftp.perl.org:/pub/CPAN
 Getting directory CPAN/
 Symlinking CPAN.html -> authors/Jon_Orwant/CPAN.html
 Symlinking ENDINGS -> .cpan/ENDINGS
 Getting file MIRRORED.BY
 Getting file MIRRORING.FROM
 Getting file README
 Symlinking README.html -> index.html
 Symlinking RECENT -> indices/RECENT-print
 Getting file RECENT.html
 Getting file ROADMAP
 Getting file ROADMAP.html
 Getting file SITES
 Getting file SITES.html
 Getting directory authors/
 Getting file 00.Directory.Is.Not.Maintained.Anymore
 Getting file 00upload.howto
 Getting file 00whois.html
 Getting file 01mailrc.txt.gz
 Symlinking Aaron_Sherman -> id/ASHER
 Symlinking Abigail -> id/ABIGAIL
 Symlinking Achim_Bohnet -> id/ACH
 Symlinking Alan_Burlison -> id/ABURLISON
 ...
 

When we run it again a few minutes later, we see messages indicating that most of the files are current and don't need to be updated:

% ftp_mirror.pl --verbose ftp.perl.org:/pub/CPAN
 Getting directory CPAN/
 Symlinking CPAN.html -> authors/Jon_Orwant/CPAN.html
 Symlinking ENDINGS -> .cpan/ENDINGS
 Getting file MIRRORED.BY: not newer than local copy.
 Getting file MIRRORING.FROM: not newer than local copy.
 Getting file README: not newer than local copy.
 ...
 

The major weak point of this script is the parse_listing() routine. Because the FTP directory listing format is not standardized, server implementations vary slightly. During development, I tested this script on a variety of UNIX FTP daemons as well as on the Microsoft IIS FTP server. However, this script may well fail with other servers. In addition, the regular expression used to parse directory entries will probably fail on filenames that begin with whitespace.

Net::Telnet

FTP is the quintessential line-oriented server application. Every command issued by the client takes the form of a single, easily parsed line, and each response from the server to the client follows a predictable format. Many of the server applications that we discuss in later chapters, including POP, SMTP, and HTTP, are similarly simple. This is because the applications were designed to interact primarily with software, not with people.

Telnet is almost exactly the opposite. It was designed to interact directly with people, not software. The output from a Telnet session is completely unpredictable, depending on the remote host's configuration, the shell the user has installed, and the setup of the user's environment.

Telnet does some things that make it easy for human beings to use: It puts its output stream into a mode that echoes back all commands that are sent to it, allowing people to see what they type, and it puts its input stream into a mode that allows it to read and respond to one character at a time. This allows command-line editing and full-screen text applications to work.

While these features make it easy for humans to use Telnet-based applications, it makes scripting such applications a challenge. Because the Telnet protocol is more complex than sending commands and receiving responses, you can't simply connect a socket to port 23 (Telnet's default) on a remote machine and start exchanging messages. Before the Telnet client and server can talk, they must engage in a handshake procedure to negotiate communications session parameters. Nor is it possible for a Perl script to open a pipe to the Telnet client program because the Telnet, like many interactive programs, expects to be opened on a terminal device and tries to change the characteristics of the device using various ioctl() calls.

Given these factors, it is best not to write clients for interactive applications. Sometimes, though, it's unavoidable. You may need to automate a legacy application that is available only as an interactive terminal application. Or you may need to remotely drive a system utility that is only accessible in interactive form. A classic example of the latter is the UNIX passwd program for changing users' login passwords. Like Telnet, passwd expects to talk directly to a terminal device, and you must do special work to drive it from a Perl script.

The Net::Telnet module provides access to Telnet-based services. With its facilities, you can log into a remote host via the Telnet protocol, run commands, and act on the results using a straightforward pattern-matching idiom. When combined with the IO::Pty module, you can also use Net::Telnet to control local interactive programs.

Net::Telnet was written by Jay Rogers and is available on CPAN. It is a pure Perl module, and will run unmodified on Windows and Macintosh systems. Although it was designed to interoperate with UNIX Telnet daemons, it is known to work with the Windows NT Telnet daemon available on the Windows NT Network Resource Kit CD and several of the freeware daemons.

A Simple Net::Telnet Example

Figure 6.3 shows a simple script that uses Net::Telnet. It logs into a host, runs the command ps -ef to list all running processes, and then echoes the information to standard output.

Figure 6.3. remoteps.pl logs into a remote host and runs the "ps" command

Lines 13: Load modules We load the Net::Telnet module. Because it is entirely object-oriented, there are no symbols to import.

Lines 46: Define constants We hard-code constants for the host to connect to, and the user and password to log in as (no, this isn't my real password!). You'll need to change these as appropriate for your system.

Line 7: Create a new Net::Telnet object We call Net::Telnet->new() with the name of the host. Net::Telnet attempts to connect to the host, returning a new Net::Telnet object if successful or, if a connection could not be established, undef.

Line 8: Log in to remote host We call the Telnet object's login() method with the username and password. login() will attempt to log in to the remote system, and will return true if successful.

Lines 910: Run the "ps" command We invoke the cmd() method with the command to run, in this case ps -ef. If successful, this method returns an array of lines containing the output of the command (including the newlines). We print the result to standard output.

When we run the remoteps.pl script, there is a brief pause while the script logs into the remote host, and then the output of the ps command appears, as follows:

% remoteps1.pl
 UID    PID PPID C STIME TTY      TIME CMD
 root     1   0  0 Jun26 ?    00:00:04 init
 root     2   1  0 Jun26 ?    00:00:15 [kswapd]
 root     3   1  0 Jun26 ?    00:00:00 [kflushd]
 root     4   1  0 Jun26 ?    00:00:01 [kupdate]
 root    34   1  0 Jun26 ?    00:00:01 /sbin/cardmgr
 root   114   1 30 Jun26 ?    19:18:46 [kapmd]
 root   117   1  0 Jun26 ?    00:00:00 [khubd]
 bin    130   1  0 Jun26 ?    00:00:00 /usr/sbin/rpc.portmap
 root   134   1  0 Jun26 ?    00:00:25 /usr/sbin/syslogd
 ...
 

Net::Telnet API

To accommodate the many differences between Telnet implementations and shells among operating systems, the Net::Telnet module has a large array of options. We only consider the most frequently used of them here. See the Net::Telnet documentation for the full details.

Net::Telnet methods generally have both a named-argument form and a "shortcut" form that takes a single argument only. For example, new() can be called either this way:

my $telnet = Net::Telnet->new('phage.cshl.org');
 

or like this:

my $telnet = Net::Telnet->new(Host=>'phage.cshl.org', Timeout=>5);
 

We show both forms when appropriate.

The new() method is the constructor for Net::Telnet objects:

$telnet = Net::Telnet->new($host)

$telnet = Net::Telnet->new(Option1=>$value1,Option2=>$value2 ..)

The new() method creates a new Net::Telnet object. It may be called with a single argument containing the name of the host to connect to, or with a series of option/ value pairs that provide finer control over the object. new() recognizes many options, the most common of which are shown in Table 6.2.

Table 6.2. Net::Telnet->new() Arguments
Option Description Default Value
Host Host to connect to "localhost"
Port Port to connect to 23
Timeout Timeout for pattern matches, in seconds 10
Binmode Suppress CRLF translation false
Cmd_remove_mode Remove echoed command from input "auto"
Errmode Set the error mode "die"
Input_log Log file to write input to none
Fhopen Filehandle to communicate over none
Prompt Command-line prompt to match "/[\$%#>]$/"

The Host and Port options are the host and port to connect to, and Timeout is the period in seconds that Net::Telnet will wait for an expected pattern before declaring a timeout.

Binmode controls whether Net::Telnet will perform CRLF translation. By default (Binmode=>0), every newline sent from the script to the remote host is translated into a CRLF pair, just as the Telnet client does it. Likewise, every CRLF received from the remote host is translated into a newline. With Binmode set to a true value, this translation is suppressed and data is transmitted verbatim.

Cmd_remove_mode controls the removal of echoed commands. Most implementations of the Telnet server echo back all user input. As a result, text you send to the server reappears in the data read back from the remote host. If CMD_REMOVE_MODE is set to true, the first line of all data received from the server will be stripped. A false value prevents stripping, and a value of "auto" allows Net::Telnet to decide for itself whether to strip based on the "echo" setting during the initial Telnet handshake.

Errmode determines what happens when an error occurs, typically an expected pattern not being seen before the timeout. The value of Errmode can be one of the strings "die" (the default) or "return". When set to "die", Net::Telnet dies on anerror, aborting your program. A value of "return" modifies this behavior, so that instead of dying the failed method returns undef. You can then recover the specific error message using errmsg(). In addition to these two strings, Errmode accepts either a code reference or an array reference. Both of these forms are used to install custom handlers that are invoked when an error occurs. The Net::Telnet documentation provides further information.

The value for Input_log should be a filename or a filehandle. All data received from the server is echoed to this file or filehandle. Since the received data usually contains the echoed command, this is a way to capture a transcript of the Net::Telnet session and is invaluable for debugging. If the argument is a previously opened filehandle, then the log is written to that filehandle. Otherwise, the argument is treated as the name of a file to open or create.

The Fhopen argument can be used to pass a previously opened filehandle to Net::Telnet for it to use in communication. Net::Telnet will use this filehandle instead oftrying to open its own connection. We use this later to coerce Net::Telnet into working across a Secure Shell link.

Prompt sets the regular expression that Net::Telnet uses to identify the shell command-line prompt. This is used by the login() and cmd() methods to determine that the command ran to completion. By default, Prompt is set to a pattern that matches the default sh, csh, ksh, and tcsh prompts.

Once a Net::Telnet object is opened you control it with several object modules:

$result = $telnet->login($username,$password)

$result = $telnet->login(Name => $username,

Password => $password,

[Prompt => $prompt,]

[Timeout=> $timeout])

The login() method attempts to log into the remote host using the provided username and password. In the named-parameter form of the method call, you may override the values of Prompt and Timeout provided to new().

If the Errmode is "die" and the login method encounters an error, the call aborts your script with an error message. Otherwise, login() returns false.

$result = $telnet->print(@values)

Print a value or list of values to the remote host. A newline is automatically added for you unless you explicitly disable this feature (see the Net::Telnet documentation for details). The method returns true if all of the data was successfully written.

It is also possible to bypass Net::Telnet's character translation routines and write directly to the remote host by using the Net::Telnet object as a filehandle:

print $telnet "ls -lF\015\012";
 

$result = $telnet->waitfor($pattern)

($before,$match) = $telnet->waitfor($pattern)

($before,$match) = $telnet->waitfor([Match=>$pattern,]

[String=>$string,]

[Timeout=>$timeout])

The waitfor() method is the workhorse of Net::Telnet. It waits up to Timeout seconds for the specified string or pattern to appear on the data stream coming from the remote host. In a scalar context, waitfor() returns a true value if the desired pattern was seen. In a list context, the method returns a two-element list consisting of the data seen before the match and the matched string itself.

You can give waitfor() a regular expression to pattern match or a simple string, in which case Net::Telnet uses index() to scan for it in incoming data. In the method's named-argument form, use the Match argument for a pattern match, and String for a simple string match. You can specify multiple alternative patterns or strings to match simply by providing more than one Match and/or String arguments.

The strings used for MATCH must be correctly delimited Perl pattern match operators. For example, "/bash> $/" and "m(bash> $)" will both work, but "bash> $" won't because of the absence of pattern match delimiters.

In the single-argument form of waitfor(), the argument is a pattern match. The Timeout argument may be used to override the default timeout value.

This code fragment will issue an ls -lF command, wait for the command line prompt to appear, and print out what came before the prompt, which ought to be the output of the ls command:

$telnet->print('ls -lF');
 ($before,$match) = $telnet->waitfor('/[$%#>] $/');
 print $before;
 

To issue a command to the remote server and wait for a response, you can use one of several versions of cmd():

$result = $telnet->cmd($command)

@lines = $telnet->cmd($command)

@lines = $telnet->cmd(String=>$command,

[Output=>$ref,] [Prompt=>$pattern,]

[Timeout=>$timeout,] [Cmd_remove_mode=>$mode]

The cmd() method is used to send a command to the remote host and return its output, if any. It is equivalent to a print() of the command, followed by a waitfor() using the default shell prompt pattern.

In a scalar context, cmd() returns true if the command executed successfully, false if the method timed out before the shell prompt was seen. In a list context, this method returns all the lines received prior to matching the prompt.

In the named-argument form of the call, the Output argument designates either a scalar reference or an array reference to receive the lines that preceded the match. The Prompt, Timeout, and Cmd_remove_mode arguments allow you to override the corresponding settings.

Note that a true result from cmd() does not mean that the command executed successfully. It only means that the command completed in the time allotted for it.

To receive data from the server without scanning for patterns, use get(), getline(), or getlines():

$data = $telnet->get([Timeout=>$timeout])

The get() method performs a timed read on the Telnet session, returning any data that is available. If no data is received within the allotted time, the method dies if Errmode is set to "die" or returns undef otherwise. The get() method also returns undef on end-of-file (indicating that the remote host has closed the Telnet session). You can use eof() and timed_out() to distinguish these two possibilities.

$line = $telnet->getline([Timeout=>$timeout])

The getline() method reads the next line of text from the Telnet session. Like get(), it returns undef on either a timeout or an end-of-file. You may change the module's notion of the input record separator using the input_record_separator() method, described below.

@lines = $telnet->getlines([Timeout=>$timeout])

Return all available lines of text, or an empty list on timeout or end-of-file.

Finally, several methods are useful for debugging and for tweaking the communications session:

$msg = $telnet->errmsg

This method returns the error message associated with a failed method call. For example, after a timeout on a waitfor(), errmsg() returns "pattern match timed-out."

$line = $telnet->lastline

This method returns the last line read from the object. It's useful to examine this value after the remote host has unexpectedly terminated the connection because it might contain clues to the cause of this event.

$value = $telnet->input_record_separator([$newvalue])

$value = $telnet->output_record_separator([$newvalue])

These two methods get and/or set the input and output record separators. The input record separator is used to split input into lines, and is used by the getline(), getlines(), and cmd() methods. The output record separator is printed at the end of each line output by the print() method. Both values default to \n.

$value = $telnet->prompt([$newvalue])

$value = $telnet->timeout([$newvalue])

$value = $telnet->binmode([$newvalue])

$value = $telnet->errmode([$newvalue])

These methods get and/or set the corresponding settings, and can be used to examine or change the defaults after the Telnet object is created.

$telnet->close

The close() method severs the connection to the remote host.

A Remote Password-Changing Program

As a practical example of Net::Telnet, we'll develop a remote password-changing script named change_passwd.pl. This script will contact each of the hosts named on the command line in turn and change the user's login password. This might be useful for someone who has accounts on several machines that don't share the same authentication database. The script is used like this:

% change_passwd.pl --old=mothergOOse --new=bopEEp chiron masdorf sceptre
 						

This command line requests the script to change the current user's password on the three machines chiron, masdorf, and sceptre. The script reports success or failure to change the password on each of the indicated machines.

The script uses the UNIX passwd program to do its work. In order to drive passwd, we need to anticipate its various prompts and errors. Here's a sample of a successful interaction:

% passwd
 Changing password for lstein
 Old password: xyzzy
 Enter the new password (minimum of 5, maximum of 8 characters)
 Please use a combination of upper and lower case letters and numbers.
 New password: plugn
 Re-enter new password: plugn
 Password changed.
 

At the three password: prompts I typed my current and new passwords. However, the passwd program turns off terminal echo so that the passwords don't actually display on the screen.

A number of errors may occur during execution of passwd. In order to be robust, the password-changing script must detect them. One error occurs when the original password is typed incorrectly:

% passwd
 Changing password for lstein
 Old password: xyzyy
 Incorrect password for lstein.
 The password for lstein is unchanged.
 

Another error occurs when the new password doesn't satisfy the passwd program's criteria for a secure, hard-to-guess password:

% passwd
 Changing password for lstein
 Old password: xyzzy
 Enter the new password (minimum of 5, maximum of 8 characters)
 Please use a combination of upper and lower case letters and numbers.
 New password: hi
 Bad password: too short. Try again.
 New password: aaaaaaaaaa
 Bad password: a palindrome. Try again.
 New password: 12345
 Bad password: too simple. Try again.
 

This example shows several attempts to set the password, each one rejected for a different reason. The common part of the error message is "Bad password." We don't have to worry about a third common error in running passwd, which is failing to retype the password correctly at the confirmation prompt.

The change_passwd.pl script is listed in Figure 6.4.

Figure 6.4. Remote password-changing script

Lines 14: Load modules We load Net::Telnet and the Getopt::Long module for command-line option parsing.

Lines 512: Define constants We create a DEBUG flag. If this is true, then we instruct the Net::Telnet module to log all its input to a file named passwd.log. This file contains password information, so be sure to delete it promptly. The USAGE constant contains the usage statement printed when the user fails to provide the correct command-line options.

Lines 1319: Parse command line options We call GetOptions() to parse the command-line options. We default to the current user's login name if none is provided explicitly using the LOGNAME environment variable. The old and new password options are mandatory.

Line 20: Invoke change_passwd() subroutine For each of the machines named on the command line, we invoke an internal subroutine named change_passwd(), passing it the name of the machine, the user login name, and the old and new passwords.

Lines 2141: change_passwd() subroutine Most of the work happens in change_ passwd(). We begin by opening up a new Net::Telnet object on the indicated host, and then store the object in a variable named $shell. If DEBUG is set, we turn on logging to a hard-coded file. We also set errmode() to "return" so that Net::Telnet calls will return false rather than dying on an error.

We now call login() to attempt to log in with the user's account name and password. If this fails, we return with a warning constructed from the Telnet object's errmsg() routine.

Otherwise we are at the login prompt of the user's shell. We invoke the passwd command and wait for the expected "Old password:" prompt. If the prompt appears within the timeout limit, we send the old password to the server. Otherwise, we return with an error message.

Two outcomes are possible at this point. The passwd program may accept the password and prompt us for the new password, or it may reject the password for some reason. We wait for either of the prompts to appear, and then examine the match string returned by waitfor() to determine which of the two patterns we matched. In the former case, we proceed to provide the new password. In the latter, we return with an error message.

After the new desired password is printed (line 33), there are again two possibilities: passwd may reject the proposed password because it is too simple, or it may accept it and prompt us to confirm the new password. We handle this in the same way as before.

The last step is to print the new password again, confirming the change. We do not expect any errors at this point, but we do wait for the "Password changed" confirmation before reporting success.

Because there is little standardization among passwd programs, this script is likely to work only with those variants of UNIX that use a passwd program closely derived from the BSD version. To handle other passwd variants, you will need to modify the pattern matches appropriately by including other Match patterns in the calls to waitfor().

Running change_passwd.pl on a network of Linux systems gives output like this:

% change_passwd.pl --user=george --old=m00nd0g --new=swampH0und  \
 localhost pesto prego romano
 Password changed for george on localhost.
 Password changed for george on pesto.
 Password changed for george on prego.
 Password changed for george on romano.
 

While change_passwd.pl is running, the old and new passwords are visible to anyone who runs a ps command to view the command lines of running programs. If you wish to use this script in production, you will probably want to modify it so as to accept this sensitive information from standard input. Another consideration is that the password information is passed in the clear, and therefore vulnerable to network sniffers. The SSH-enabled password-changing script in the next section overcomes this difficulty.

Using Net::Telnet for Non-Telnet Protocols

Net::Telnet can be used to automate interactions with other network servers. Often it is as simple as providing the appropriate Port argument to the new() call. The Net::Telnet manual page provides an example of this with the POP3 protocol, which we discuss in Chapter 8.

With help from the IO::Pty module, Net::Telnet can be used to automate more complicated network services or to script local interactive programs. Like the standard Telnet client, the problem with local interactive programs is that they expect access to a terminal device (a TTY) in order to change screen characteristics, control the cursor, and so forth. What the IO::Pty module does is to create a "pseudoterminal device" for these programs to use. The pseudoterminal is basically a bidirectional pipe. One end of the pipe is attached to the interactive program; from the program's point of view, it looks and acts like a TTY. The other end of the pipe is attached to your script, and can be used to send data to the program and read its output.

Because the use of pseudoterminals is a powerful technique that is not well documented, we will show a practical example. Many security-conscious sites have replaced Telnet and FTP with the Secure Shell (SSH), a remote login protocol that authenticates and encrypts login sessions using a combination of public key and symmetric cryptography. The change_passwd.pl script does not work with sites that have disabled Telnet in favor of SSH, and we would like to use the ssh client to establish the connection to the remote host in order to run the passwd command.

The ssh client emits a slightly different login prompt than Telnet. A typical session looks like this:

% ssh -l george prego
 george@prego's password: *******
 Last login: Mon Jul 3 08:20:28 2000 from localhost
 Linux 2.4.01.
 %
 

The ssh client takes an optional -l command-line switch to set the name of the user to log in as, and the name of the remote host (we use the short name rather than the fully qualified DNS name in this case). ssh prompts for the password on the remote host, and then attempts to log in.

To work with ssh, we have to make two changes to change_passwd.pl: (1) we open a pseudoterminal on the ssh client and pass the controlling filehandle to Net::Telnet->new() as the Fhopen argument and (2) we replace the call to login() with our own pattern matching routine so as to handle ssh's login prompt.

The IO::Pty module, available on CPAN, has a simple API:

$pty = IO::Pty->new

The new() method takes no arguments and returns a new IO::Pty pseudoterminal object. The returned object is a filehandle corresponding to the controlling end of the pipe. Your script will ordinarily use this filehandle to send commands and read results from the program you're driving.

$tty = $pty->slave

Given a pseudoterminal created with a call to IO::Pty->new(), the slave(), method returns the TTY half of the pipe. You will ordinarily pass this filehandle to the program you want to control.

Figure 6.5 shows the idiom for launching a program under the control of a pseudoterminal. The do_cmd() subroutine accepts the name of a local command to run and a list of arguments to pass it. We begin by creating a pseudoterminal filehandle with IO::Pty->new() (line 3). If successful, we fork(), and the parent process returns the pseudoterminal to the caller. The child process, however, has a little more work to do. We first detach from the current controlling TTY by calling POSIX::setsid() (see Chapter 10 for details). The next step is to recover the TTY half of the pipe by calling the IO::Pty object's slave(), method, and then close the pseudoterminal half (lines 78).

Figure 6.5. Launching a program in a pseudo-tty

We now reopen STDIN, STDOUT, and STDERR on the new TTY object using fdopen(), and close the now-unneeded copy of the filehandle (lines 912). We make STDOUT unbuffered and invoke exec() to run the desired command and arguments. When the command runs, its standard input and output will be attached to thenew TTY, which in turn will be attached to the pseudo-tty controlled by the parent process.

With do_cmd() written, the other changes to change_passwd.pl are relatively minor. Figure 6.6 shows the revised script written to use the ssh client, change_passwd_ssh.pl.

Figure 6.6. Changing passwords over a Secure Shell connection

Lines 16: Load modules We load IO::Pty and the setsid() routine from the POSIX module.

Lines 723: Process command-line arguments and call change_passwd() The only change here is a new constant, PROMPT, that contains the pattern match that we will expect from the user's shell command prompt.

Lines 2427: Launch ssh subprocess We invoke do_cmd() to run the ssh program using the requested username and host. If do_cmd() is successful, it returns a filehandle connected to the pseudoterminal driving the ssh subprocess.

Lines 2831: Create and initialize Net:: Telnet object In the change_passwd() routine, we create a new Net::Telnet object, but now instead of allowing Net::Telnet to open a connection to the remote host directly, we pass it the ssh filehandle using the Fhopen argument. After creating the Net::Telnet object, we configure it by putting it into binary mode with binmode(), setting the input log for debugging, and setting the error mode to "return". The use of binary mode is a small but important modification of the original script. Since the SSH protocol terminates its lines with a single LF character rather than CRLF pairs, the default Net::Telnet CRLF translation is inappropriate.

Lines 3234: Log in Instead of calling Net::Telnet's built-in login() method, which expects Telnet-specific prompts, we roll our own by waiting for the ssh "password:" prompt and then providing the appropriate response. We then wait for the user's command prompt. If, for some reason, this fails, we return with an error message.

Lines 3549: Change password The remainder of the change_passwd() subroutine is identical to the earlier version.

Lines 5065: do_cmd() subroutine This is the same subroutine that we examined earlier.

The change_passwd_ssh.pl program now uses the Secure Shell to establish connections to the indicated machines and change the user's password. This is a big advantage over the earlier version, which was prone to network eavesdroppers who could intercept the new password as it passed over the wire in unencrypted form. On multiuser systems you will still probably want to modify the script to read the passwords from standard input rather than from the command line.

For completeness, Figure 6.7 lists a routine, prompt_for_passwd(i), that uses the UNIX stty program to disable command-line echo temporarily while the user is typing the password. You can use it like this:

$old = get_password('old password');
 $new = get_password('new password');
 
Figure 6.7. Disabling echo while prompting for a password

A slightly more sophisticated version of this subroutine, which takes advantage of the Term::ReadKey module, if available, appears in Chapter 20.

The Expect Module

An alternative to Net::Telnet is the Expect module, which provides similar services for talking to local and remote processes that expect human interaction. Expect implements a rich command language, which among other things can pause the script and prompt the user for information, such as passwords. Expect can be found on CPAN.

Chapter 7. SMTP: Sending Mail

E-mail is one of the oldest Internet applications, and it should come as no surprise that many client-side modules have been written to enable Perl to interoperate with the mail system. Various modules allow you to send and receive mail, manipulate various mailbox formats, and work with MIME attachments.

Introduction to the Mail Modules

If you examine the "Internet Mail and Utilities" section of CPAN, you'll find a bewildering array of similarly named modules. This is a quick guide to the major components.

Net::SMTP This allows you to interact directly with mail transport daemons in order to send Internet mail via the Simple Mail Transport Protocol (SMTP). The module also provides access to some of the other functions of these daemons, such as expanding e-mail aliases.

MailTools This is a higher-level way to create outgoing e-mail. It uses a variety of local mailer packages to do the grunt work.

MIME-Tools This is a package of modules for creating, decoding, and manipulating Multipurpose Internet Mail Extensions (MIME), commonly known as attachments.

Net::POP3 This is a client for the Post Office Protocol version 3 (POP3). It provides a way to retrieve a user's stored mail messages from a central maildrop.

Net::IMAP This is a client module for the Internet Message Access Protocol (IMAP), a sophisticated protocol for storing and synchronizing e-mail messages between mail drops and clients.

This chapter discusses tools involved in creating outgoing mail, including Net::SMTP and MIME-Tools. Chapter 8 covers the Net::POP3 and Net::IMAP modules, both of which are involved in processing incoming mail.

Net::SMTP

Net::SMTP operates at the lowest level of the e-mail access modules. It interacts directly with the SMTP daemons to transmit e-mail across the Internet. To use it effectively, you must know a bit about the innards of SMTP. The payoff for this added complexity is that Net::SMTP is completely portable, and works as well from Macintoshes and Windows machines as from UNIX systems.

The SMTP Protocol

When a client e-mail program wants to send mail, it opens a network connection to a mail server somewhere using the standard SMTP port, number 25. The client conducts a brief conversation with the server, during which time it establishes its identity, announces that it wishes to send mail to a certain party, and transmits the e-mail message. The server then takes care of seeing that the message gets where it needs to go, whether by delivering it to a local user or by transmitting the message to another server somewhere else in the world.

The language spoken by SMTP servers is a simple human-readable line-oriented protocol. Figure 7.1 shows the interaction needed to send a complete e-mail manually using Telnet as the client (the client's input is in bold).

Figure 7.1. A chat with an SMTP daemon

After connecting to the SMTP port, the server sends us a code "220" message containing a banner and greeting. We issue a HELO command, identifying the hostname of the client machine, and the server responds with a "250" message, which essentially means "OK."

After this handshake, we are ready to send some mail. We issue a MAIL command with the argument <From: sender's address>, to designate the sender. If the sender is OK, the server responds with another "250" reply. We now issue a RCPT ("recipient") command with the argument <To: recipient's address> to indicate the recipient. The server again acknowledges the command. Some SMTP servers have restrictions on the senders and recipients they will service; for example, they may refuse to relay e-mail to remote domains. Inthis case, they respond with a variety of error codes in the 500 to 599 range. It is possible to issue multiple RCPT commands for e-mail that has several recipients at the site(s) served by the SMTP server.

Having established that the sender and recipient(s) are OK, we send the DATA command. The server responds with a message prompting us for the e-mail message. The server will accept lines of input until it sees a line containing just a ".".

Internet mail has a standard format consisting of a set of header lines, ablank line, and the body of the message. Even though we have already specified the sender and recipient, we must do so again in order to create a valid e-mailmessage. A minimal mail header has a From: field, indicating the sender, a To: field, indicating the recipient, and a Subject: field. Other standard fields, such as the date, are filled in automatically by the mail daemon.

We add a blank line to separate the header from the body, enter the e-mail message text, and terminate the message with a dot. The server's code 250 acknowledgment indicates that the message was queued successfully for delivery.

We could now send additional messages by issuing further MAIL commands, but instead we disconnect politely by issuing the QUIT command. The full specification of the SMTP protocol can be found in RFC 821. The standard format for Internet mail headers is described in RFC 822.

The Net::SMTP API

Net::SMTP mirrors the SMTP protocol very closely. Net::SMTP is part of the libnet utilities and is available on CPAN. Like the other Net::* modules, it uses an object-oriented interface in which you establish a connection with a particular mailer daemon, yielding a Net::SMTP object. You then call the SMTP object's methods to send commands to the server. Like Net::FTP (but unlike Net::Telnet), Net::SMTP inherits from Net::Cmd and IO::Socket::INET, allowing you to use the Net::Cmd message() and code() methods to retrieve the most recent message and numeric status code from the server. All the low-level IO::Socket and IO::Socket::INET methods are also inherited.

To create a new Net::SMTPobject, use the new() constructor:

Table 7.1. Net::SMTP->new() Arguments
Option Description Default
Hello The domain name to use in the HELO command. Name of local host
Timeout Seconds to wait for response from server. 120
Debug Turn on verbose debugging information. undef
Port Numeric or symbolic name of port to connect to. 25

$smtp = Net::SMTP->new([$host] [,$opt1=>$val1, $opt2=>$val2...])

The new() method establishes a connection to an SMTP server and returns a new Net::SMTP object. The first optional argument is the name of the host to contact, and will default to the mail exchanger configured into Net::Config when libnet was first installed. The options are a series of named arguments. In addition to the options recognized by the IO::Socket::INET superclass, the arguments shown in Table 7.1 are possible.

If the connection is refused (or times out), new() returns false. Here's an example of contacting the mail server for the cshl.org domain with a timeout of 60 seconds.

$smtp = Net::SMTP->new('mail.cshl.org',Timeout=>60);
 

Once the object is created, you can send or retrieve information to the server by calling object methods. Some are quite simple:

$banner = $smtp->banner()

$domain = $smtp->domain()

Immediately after connecting to an SMTP server, you can retrieve the banner and/or domain name with which it identified by calling these two methods.

To send mail, you will first call the mail() and recipient() methods to set up the exchange:

$success = $smtp->mail($address [,\%options])

The mail() method issues a MAIL command to the server. The required first argument is the address of the sender. The optional second argument is a hash reference containing various options to be passed to servers that support the Extended Simple Mail Transport Protocol, or ESMTP. These are rarely needed; see the Net::SMTP documentation for details.

The address may be in any of the forms accepted by e-mail clients, including doe@acme.org, <doe@acme.org>, John Doe <doe@acme.org>, and doe@acme.org (John Doe).

If successful, this method returns a true value. Otherwise, it returns undef, and the inherited message() method can be used to return the text of the error message.

$success = $smtp->recipient($address1,$address2,$address3,...)

@ok_addr = $smtp-> recipient($addr1,$addr2,$addr3,...,{SkipBad=>1})

The recipient() method issues an RCPT command to the server. The arguments are a list of valid e-mail addresses to which the mail is to be delivered. The list of addresses may be followed by a hash reference containing various options.

The addresses passed to recipient() must all be acceptable to the server, or the entire call will return false. To modify this behavior, pass the option SkipBad in the options hash. The module now ignores addresses rejected by the server, and returns the list of accepted addresses as its result. For example:

@ok=$smtp->recipient('lstein@cshl.org','nobody@cshl.org',{SkipBad=>1})
 

Provided that the server has accepted the sender and recipient, you may now commence sending the message text using the data(), datasend(), and dataend() methods.

$success = $smtp->data([$text])

The data() method issues a DATA command to the server. If called with a scalar argument, it transmits the value of the argument as the content (header and body) of the e-mail message. If you wish to send the message one chunk at a time, call data without an argument and make a series of calls to the datasend() method. This method returns a value indicating success or failure of the command.

$success = $smtp->datasend(@data)

After calling data() without an argument, you may call datasend() one or more times to send lines of e-mail text to the server. Lines starting with a dot are automatically escaped so as not to terminate the transmission prematurely.

You may call datasend() with an array reference, if you prefer. This method and dataend() are both inherited from the Net::Cmd base class.

$success = $smtp->dataend

When your e-mail message is sent, you should call dataend() to transmit the terminal dot. If the message was accepted for delivery, the return value is true.

Two methods are useful for more complex interactions with SMTPservers:

$smtp->reset

This sends an RSET command to the server, aborting mail transmission operations in progress. You might call this if one of the desired recipients is rejected by the server; it resets the server so you can try again.

$valid = $smtp->verify($address)

@recipients = $smtp->expand($address)

The expand() and verify() methods can be used to check that a recipient address is valid prior to trying to send mail. verify() returns true if the specified address is accepted.

expand() does something more interesting. If the address is valid, it expands it into one or more aliases, if any exist. This can be used to identify forwarding addresses and mailing list recipients. The method returns a list of aliases or, if the specified address is invalid, an empty list. For security reasons, many mail administrators disable this feature, in which case, the method returns an empty list.

Finally, when you are done with the server, you will call the quit() method:

$smtp->quit

This method politely breaks the connection with the server.

Using Net::SMTP

With Net::SMTP we can write a one-shot subroutine for sending e-mail. The mail() subroutine takes two arguments: the text of an e-mail message to send (required), and the name of the SMTP host to use (optional). Call it like this:

$msg =  'END';
 From: John Doe <doe@acme.org>
 To:  L Stein <lstein@lsjs.org>
 Cc:  jac@acme.org, vvd@acme.org
 Subject: hello there
 
 This is just a simple e-mail message.
 Nothing to get excited about.
 
 Regards, JD
 END
 
 mail($msg,'presto.lsjs.org') or die "arggggh!";
 

We create the text of the e-mail message using the here-is () syntax and store it in the variable $msg. The message must contain an e-mail header with (at a minimum) the From: and To: fields. We pass the message to the mail() subroutine, which extracts the sender and recipient fields and invokes Net::SMTP to do the dirty work. Figure 7.2 shows how mail() works.

Figure 7.2. A simple subroutine for sending e-mail

Lines 19: Parse the mail message We split the message into the header and the body by splitting on the first blank line. Header fields frequently contain continuation lines that begin with a blank, so we fold those into a single line.

We parse the header into a hash using a simple pattern match, and store the From: and To: fields in local variables. The To: field can contain multiple recipients, so we isolate the individual addressees by splitting on the comma character (this will fail in the unlikely case that any of the addresses contain commas). We do likewise if the header contained a Cc: field.

Lines 1016: Send messages We create a new Net::SMTP object and call its mail(), and recipient() methods to initiate the message. The call to recipient() uses the SkipBad option so that the method will try to deliver the mail even if the server rejects some of the recipients. We compare the number of recipients the server accepted to the number we attempted, returning from the subroutine if none were accepted, or just printing a warning if only some were rejected.

We call data() to send the complete e-mail message to the server, and quit() to terminate the connection.

Although this subroutine does its job, it lacks some features. For example, it doesn't handle the Bcc: field, which causes mail to be delivered to a recipient without that recipient appearing in the header. The MailTools module, described next, corrects the deficiencies.

MailTools

The MailTools module, also written by Graham Barr, is a high-level object-oriented interface to the Internet e-mail system. MailTools, available on CPAN, provides a flexible way to create and manipulate RFC 822-compliant e-mail messages. Once the message is composed, you can send it off using SMTP or use one of several UNIX command-line mailer programs to do the dirty work. This might be necessary on a local network that does not have direct access to an SMTP server.

Using MailTools

A quick example of sending an e-mail from within a script will give you the flavor of the MailTools interface (Figure 7.3).

Figure 7.3. Sending e-mail with Mail::Internet

graphics/07fig03.gif

Lines 12: Load modules We bring in the Mail::Internet module. It brings in other modules that it needs, including Mail::Header, which knows how to format RFC 822 headers, and Mail::Mailer, which knows how to send mail by a variety of methods.

Lines 38: Create header We call Mail::Header->new to create a new header object, which we will use to build the RFC 822 header. After creating the object, we call its add() method several times to add the From:, To:, Cc:, and Subject: lines. Notice that we can add the same header multiple times, as we do with the Cc: line. Mail::Header will also insert other required RFC 822 headers on its own.

Lines 913: Create body We create the body text, which is just a block of text.

Lines 1416: Create the Mail::Internet object We now create a new Mail::Internet object by calling the package's new() method. The named arguments include Header, to which we pass the header object that we just created, and Body, which receives the body text. The Body argument expects an array reference containing discrete lines of body text, so we wrap $body into an anonymous array reference. Modify, the third argument to new(), flags Mail::Internet that it is OK to reformat the header lines to meet restrictions on line length that some SMTP mailers impose.

Line 17: Send mail We call the newly created Mail::Internet object's send() method with an argument indicating the sending method to use. The "sendmail" argument indicates that Mail::Internet should try to use the UNIX sendmail program to deliver the mail.

Although at first glance Mail::Internet does not hold much advantage over the Net::SMTP-based mail() subroutine we wrote in the previous section, the ability to examine and manipulate Mail::Header objects gives MailTools its power. Mail::Header is also the base class for MIME::Head, which manipulates MIME-compliant e-mail headers that are too complex to be handled manually.

Mail::Header

E-mail headers are more complex than they might seem at first. Some fields occur just once, others occur multiple times, and some allow multiple values to be strung together by commas or another delimiter. A field may occupy a single line, or may be folded across multiple lines with leading whitespace to indicate the presence of continuation lines. The mail system also places an arbitrary limit on the length of a header line. Because of these considerations, you should be cautious of constructing e-mail headers by hand for anything much more complicated than the simple examples shown earlier.

The Mail::Header module simplifies the task of constructing, examining, and modifying RFC 822 headers. Once constructed, a Mail::Header object can be passed to Internet::Mail for sending.

Mail::Header controls the syntax but not the content of the header, which means that you can construct a header with fields that are not recognized by the mail subsystem. Depending on the mailer, a message with invalid headers might make it through to its destination, or it might get bounced. To avoid this, be careful to limit headers to the fields listed in the SMTP and MIME RFCs (RFC 822 and RFC 2045, respectively). Table 7.2 gives some of the common headers in e-mail messages.

Fields that begin with X- are meant to be used as extensions. You can safely build a header containing any number of X- fields, and the fields will be passed through unmodified by the mail system. For example:

$header = Mail::Header->new(Modify=>1);
 $header->add('X-Mailer' => "Fido's mailer v1.0");
 $header->add('X-HiMom' => 'Hi mom!');
 

Mail::Header supports a large number of methods. The following list gives the key methods. To create a new object, call the Mail::Header new() method.

Table 7.2. Mail::Header Fields
Bcc Date Received Sender
Cc From References Subject
Comments Keywords Reply-To To
Content-Type Message-ID Resent-From X-*
Content-Transfer-Encoding MIME-Version Resent-To  
Content-Disposition Organization Return-Path  

$head = Mail::Header->new([$arg] [,@options])

The new() method is the constructor for the Mail::Header class. Called with no arguments, it creates a new Mail::Header object containing an empty set of headers.

The first argument, if provided, is used to initialize the object. Two types of arguments are accepted. You may provide an open filehandle, in which case the headers are read from the indicated file, or you may provide an array reference, in which case the headers are read from the array. In either case, each line must be a correctly formatted e-mail header, such as "Subject: this is a subject."

@options, if provided, is a list of named arguments that control various header options. The one used most frequently is Modify, which if set true allows Mail::Header to reformat header lines to make them fully RFC 822-compliant. For example:

open HEADERS,"./mail.msg";
 $head = Mail::Header(\*HEADERS, Modify=>1);
 

Once a Mail::Header object is created, you may manipulate its contents in several ways:

$head->read(FILEHANDLE)

As an alternative way to populate a header object, you can create an empty object by calling new() with no arguments, and then read in the headers from a filehandle using read().

$head->add($name,$value [,$index])

$head->replace($name,$value [,$index])

$head->delete($name [,$index])

The add(), replace(), and delete() methods allow you to modify the Mail::Header object. Each takes the name of the field to operate on, the value for the field, and optionally an index that selects a member of a multivalued field.

The add() method appends a field to the header. If $index is provided, it inserts the field into the indicated position; otherwise, it appends the field to the end of the list.

The replace() method replaces the named field with the indicated value. If the field is multivalued, then $index is used to select which value to replace; otherwise, the first field is replaced.

Delete() removes the indicated field.

All three of these methods accept a shortcut form that allows you to specify the field name and value in a single line. This shortcut allows you to replace the Subject line like this:

$head->replace('Subject: returned to sender')
 

rather than like this:

$head->replace(Subject => 'returned to sender')
 

To retrieve information about a header object, you use get() to get the value of a single field, or tags() and commit() to get information about all the available fields.

$line = $head->get($name [,$index])

@lines = $head->get($name)

The get() method retrieves the named field. In a scalar context, it returns the text form of the first indicated field; in a list context it returns all such fields. You may provide an index in order to select a single member of a multivalued field.

A slightly annoying feature of get() is that the retrieved field values contain the terminating newlines. These must be removed manually with chomp().

@fields = $head->tags

Returns the list of field names (which the Mail::Header documentation calls "tags").

$count = $head->count($tag)

Returns the number of times the given tag appears in the header.

Finally, three methods are useful for exporting the header in various forms:

$string = $head->as_string

Returns the entire header as a string in the form that will appear in the message.

$has = $head->header_has ([\%headers])

The header_has () method returns the headers as a hash reference. Each key is the unique name of a field, and each value is an array reference containing the header's contents. This form is suitable for passing to Mail::Mailer->open(), as described later in this chapter.

You may also use this method to set the header by passing it a hash reference of your own devising. The composition of \%headers is similar to header_has ()'s result, but the hash values can be simple scalars if they are not multivalued.

$head->print([FILEHANDLE])

Prints header to indicated filehandle or, if not specified, to STDOUT. Equivalent to:

print FILEHANDLE $head->as_string
 

Mail::Internet

The Mail::Internet class is a high-level interface to e-mail. It allows you to create messages, manipulate them in various ways, and send them out. It was designed to make it easy to write autoresponders and other mail-processing utilities.

As usual, you create a new object using the Mail::Internet new() method:

$mail = Mail::Internet->new([$arg] [,@options])

The new() method constructs a new Mail::Internet object. Called with no arguments, it creates an empty object, which is ordinarily not particularly useful. Otherwise, it initializes itself from its arguments in much the same way as Mail::Header. The first argument, if provided, may be either a filehandle or an array reference. In the former case, Mail::Internet tries to read the headers and body of the message from the filehandle. If the first argument is an array reference, then the new object initializes itself from the lines of text contained in the array.

@options is a list of named-argument pairs. Several arguments are recognized. Header designates a Mail::Header object to use with the e-mail message. If present, this header is used, ignoring any header information provided in $arg. Similarly, Body points to an array reference containing the lines of the e-mail body. Any body text provided by the $arg input is ignored.

Once the object is created, several methods allow you to examine and modify its contents:

$arrayref = $mail->body

The body() method returns the body of the e-mail message as a reference to an array of lines of text. You may manipulate these lines to modify the body of the message.

$header = $mail->head

The head() method returns the message's Mail::Header object. Modifying this object changes the message header.

$string = $mail->as_string

$string = $mail->as_mbox_string

The as_string() and as_mbox_string() methods both return the message (both header and body) as a single string. The as_mbox_string() function returns the message in a format suitable for appending to UNIX mbox-format mailbox files.

$mail->print([FILEHANDLE})

$mail->print_header([FILEHANDLE})

$mail->print_body([FILEHANDLE})

These three methods print all or part of the message to the designated filehandle or, if not otherwise specified, STDOUT.

Several utility methods perform common transformations on the message's contents:

$mail->add_signature([$file])

$mail->remove_sig([$nlines])

These two methods manipulate the signatures that are often appended to the e-mail messages. The add_signature() function appends the signature contained in $file to the bottom of the e-mail message. If $file is not provided, then the method looks for the file $ENV{HOME}/.signature.

remove_sig() scans the last $nlines of the message body looking for a line consisting of the characters "--", which often sets the body off from the signature. The line and everything below it is removed. If not specified, $nlines defaults to 10.

$reply = $mail->reply

The reply() method creates a new Mail::Internet object with the header initialized to reply to the original message, and the body text indented. This is suitable for autoreply applications.

Finally, the send() method sends the message via the e-mail system:

$result = $mail->send([$method] [,@args])

The send() method converts message into a string and sends it using Mail::Mailer. The $method and @args arguments select and configure the mailing method. The next section describes the available methodsmail, sendmail, smtp, and test.

If no method is specified, send() chooses a default that should work on your system.

A Mail Autoreply Program

With Mail::Internet, we can easily write a simple autoreply program for received e-mail (Figure 7.4). The autoreply.pl script is similar to the venerable UNIX vacation program. When it receives mail, it checks your home directory for the existence of a file named .vacation. If the file exists, the script replies to the sender using the contents of the file. Otherwise, the program does nothing.

Figure 7.4. An autoreply program

graphics/07fig04.gif

This autoreply script takes advantage of a feature of the UNIX mail system that allows incoming e-mail to be piped to a program. Provided that you're using such a system, you may activate the script by creating a .forward file in your home directory that contains lines like the following:

lstein
 | /usr/local/bin/autoreply.pl
 

Replace the first line with your login name, and the second with the path to the autoreply script. This tells the mail subsystem to place one copy of the incoming mail in the user-specific inbox, and to send another copy to the standard input of the autoreply.pl script.

Let's step through autoreply.pl.

Lines 13: Load modules We turn on strict type checking and load the Mail::Internet module.

Lines 47: Define constants One problem with working with programs run by the mailer daemon is that the standard user environment isn't necessarily set up. This means that $ENV{HOME} and other standard environment variables may not exist. Our first action, therefore, is to look up the user's home directory and login name and store them in appropriate constants. Lines 4 and 5 use the getpwuid() function to retrieve this information. We then use the HOME constant to find the locations of the .vacation and .signature files.

Lines 89: Create a Mail::Internet object We check that the .vacation file is present and, if it is not, exit. Otherwise, we create a new Mail::Internet object initialized from the message sent us on STDIN.

Lines 1019: Check that the message should be replied to We shouldn't autoreply to certain messages, such as those sent to us in the Cc: line, or those distributed to a mailing list. Another type of message we should be very careful not to reply to are bounced messages; replying to those has the potential to set up a nasty infinite loop. The next section of the code tries to catch these situations.

We recover the header by calling the Mail::Internet object's head() method, and perform a series of pattern matches on its fields. First we check that our username is mentioned on the To: line. If not, we may be receiving this message as a Cc: or as a member of a mailing list. We next check the Precedence: field. If it's "bulk," then this message is probably part of a mass mailing. If the Subject: line contains the strings "returned mail" or "bounced mail", or if the sender is the mail system itself (identified variously as "mailer daemon," "mail subsystem," or "postmaster"), then we are likely dealing with returned mail and we shouldn't reply or risk setting up a loop. In each of these cases, we just exit normally.

Lines 2021: Generate reply To create a new message initialized as a reply to the original, we call the mail message object's reply() method.

Lines 2226: Prepend vacation message to text The reply() method will have created body text consisting of the original message quoted and indented. We prepend the contents of the .vacation file to this. We open the contents of .vacation, call the mail message's body() method to return a reference to the array of body lines, and then use unshift() to insert the contents of .vacation in front of the body. We could replace the body entirely, if we preferred.

Lines 2728: Add signature We call the reply's add_signature() method to append the contents of the user's signature file, if any, to the bottom of the message body.

Lines 2930: Send message We call the reply's send() method to send the message by the most expedient means.

Here is an example of a reply issued by the autoreply.pl script in response to the sample message we composed with Net::SMTP in the previous section. The text at the top came from ~/.vacation and the signature at the bottom from ~/.signature. The remainder is quoted from the original message.

To:  John Doe <doe@acme.org>
 From: L Stein <lstein@lsjs.org>
 Subject: Re: hello there
 Date: Fri, 7 Jul 2000 08:12:17 -0400
 Message-Id: <200007071212.IAA12128@pesto>
 
 Hello,
 
 I am on vacation from July 6-July 12, and will not be reading my
 e-mail. I will respond to this message when I return.
 
 Lincoln
 
 John Doe <doe@acme.org> writes:
 > This is just a simple e-mail message.
 > Nothing to get excited about.
 
 > Regards, JD
 --
 
 ======================================================================
 Lincoln D. Stein                  Cold Spring Harbor Laboratory
 ======================================================================
 

If you adapt this autoreply program to your own use, you might want to check the size of the quoted body and delete it if it is unusually large. Otherwise, you might inadvertently echo back a large binary enclosure.

For complex e-mail-processing applications, you should be sure to check out the procmail program, which uses a special-purpose programming language to parse and manipulate e-mail. A number of sophisticated applications have been written on top of procmail, including autoresponders, mailing list generators, and filters for spam mail.

Mail::Mailer

The last component of MailTools that we consider is Mail::Mailer, which is used internally by Mail::Internet to deliver mail. Mail::Mailer provides yet another interface for sending Internet mail. Although it doesn't provide Mail::Internet's header- and body-handling facilities, I find it simpler and more elegant to use in most circumstances.

Unlike Net::SMTP and Mail::Internet, which use object methods to compose and send mail, the Mail::Mailer object acts like a filehandle. This short code fragment shows the idiom:

use Mail::Mailer;
 my $mailer = Mail::Mailer->new;
 $mailer->open( {To       => 'lstein@lsjs.org',
                 From  => 'joe@acme.org',
                 CC   => ['jac@acme.org','vvd@acme.org'],
                 Subject => 'hello there'});
 print $mailer "This is just a simple e-mail message.\n";
 print $mailer "Nothing to get excited about.\n\n";
 print $mailer "Regards, JD\n";
 $mailer->close;
 

After creating the object with new(), we initialize it by calling open() with a hash reference containing the contents of the e-mailer header. We then use the mailer object as a filehandle to print several lines of the body text. Then we call the object's close() method to finish processing the message and send it out.

The complete list of Mail::Mailer methods is relatively short.

$mailer = Mail::Mailer->new([$method] [,@args])

The new() method creates a new Mail::Mailer object. The optional $method argument specifies how the mail will be sent out, and @args passes additional arguments to the mailer. Table 7.3 shows the currently recognized mail methods.

The contents of @args depends on the method. In the "mail" and "sendmail" methods, whatever you provide in @args is appended to the command line used to invoke the mail and sendmail programs. For the "smtp" method, you can pass the named argument Server to specify the SMTP server to use. For example:

$mailer = Mail::Mailer->new('smtp',Server => 'mail.lsjs.org')
 

Internally, Mail::Mailer opens up a pipe to the indicated mailer program unless "smtp" is specified, in which case it uses Net::SMTP to send the message. If no method is explicitly provided, then Mail::Mailer scans the command PATH looking for the appropriate executables and chooses the first method it finds, beginning with "mail." The Mail::Mailer documentation describes how you can alter this search order by setting the PERL_MAILERS environment variable.

Table 7.3. Mail:Mailer Mailing Methods
Method Description
mail Use the UNIX mail or mailx programs.
sendmail Use the UNIX sendmail program.
smtp Use Net::SMTP to send the mail.
test A debug mode that prints the contents of the message rather than mailing it.

Once created, you initialize the Mail::Mailer object with a set of header fields:

$fh = $mailer->open(\%headers)

The open() method begins a new mail message with the specified headers. For the "mail", "sendmail", and "test" mailing methods, this call forks and execs the mailer program and then returns a pipe opened on the mailer. For the "smtp" method, open(), returns a tied filehandle that intercepts calls to print() and passes them to the datasend() method of Net::SMTP. The returned filehandle is identical to the original Mail::Mailer object, so you are free to use it as a Boolean indicating success or failure of the open() call.

The argument to open() is a hash reference whose keys are the fields of the mail header, and whose values can be scalars containing the contents of the corresponding field, or array references containing the values for multivalued fields such as Cc: or To:. This format is compatible with the header_has (), method of the Mail::Header class. For example:

$mailer->open({To => ['jdoe@acme.org','coyote@acme.org'],
                From => 'lstein@cshl.org'}) or die "can't open: $!";
 

Once the object is initialized, you will print the body of the message to it using it as a filehandle:

print $mailer "This is the first line of the mail message.\n";
 

When the body is done, you should call the object's close() method:

$mailer->close

close() tidies up and sends the message. You should not use the close() Perl built-in for this purpose, because some of the Mail::Mailer methods need to do postprocessing on the message before sending it.

MIME-Tools

Net::SMTP and MailTools provide the basic functionality to create simple text-only e-mail messages. The MIME-Tools package takes this a step further by allowing you to compose multipart messages that contain text and nontext attachments. You can also parse MIME-encoded messages to extract the attachments, add or remove attachments, and resend the modified messages.

A Brief Introduction to MIME

The Multipurpose Internet Mail Extensions, or MIME, are described in detail in RFCs 1521, 2045, 2046, and 2049. Essentially, MIME adds three major extensions to standard Internet mail:

  1. Every message body has a type. In the MIME world, the body of every message has a type that describes its nature; this type is given in the Content-Type: header field. MIME uses a type/subtype nomenclature in which type indicates the category of document, and subtype gives its specific format. Table 7.4 lists some common types and subtypes. The major media categories are "audio," "video," "text," and "image." The "message" category is used for e-mail enclosures, such as when you forward an e-mail onward to someone else, and the "application" category is a hodgepodge of things that could not be classified otherwise. We'll talk about "multipart" momentarily.

  2. Every message body has an encoding. Internet e-mail was originally designed to handle messages consisting entirely of 7-bit ASCII text broken into relatively short lines; some parts of the e-mail system are still limited to this type of message. However, as the Internet became global, it became necessary to accommodate non-English character sets that have 8- or even 16-bit characters. Another problem was binary attachments such as image files, which are not even text-oriented.

    To accommodate the full range of messages that people want to send without rewriting the SMTP protocol and all supporting software, MIME provides several standard encoding algorithms that can encapsulate binary data in a text form that conventional mailers can handle. Each header has a Content-Transfer-Encoding: field that describes the message body's encoding. Table 7.5 lists the five standard encodings.

    If you are dealing with 8-bit data, only the quoted-printable and base64 encodings are guaranteed to make it through e-mail gateways.

  3. Any message may have multiple parts. The multipart/* MIME types designate messages that have multiple parts. Each part has its own content type andMIME headers. It's even possible for a part to have its own subparts. The multipart/alternativec MIME type is used when the various subparts correspond to the same document repeated in different formats. For example, some browser-based mailers send their messages in both text-only and HTML form. multipart/mixed is used when the parts are not directly related to each other, for example an e-mail message and a JPEG enclosure.

Table 7.4. Common MIME Types
Type Description
audio/* A sound
audio/basic Sun microsystem's audio "au" format
audio/mpeg An MP3 file
audio/midi An MIDI file
audio/x-aiff AIFF sound format
audio/x-wav Microsoft's "wav" format
image/* An image
image/gif Compuserve GIF format
image/jpeg JPEG format
image/png Portable network graphics format
image/tiff TIFF format
message/* An e-mail message
message/news Usenet news message format
message/rfc822 Internet e-mail message format
multipart/* A message containing multiple parts
multipart/alternative The same information in alternative forms
multipart/mixed Unrelated pieces of information mixed together
text/* Human-readable text
text/html Hypertext Markup Language
text/plain Plain text
text/richtext Enriched text in RFC 1523 format
text/tab-separated-values Tables
video/* Moving video or animation
video/mpeg MPEG movie format
video/quicktime Quicktime movie format
video/msvideo Microsoft "avi" movie format
application/* None of the above
application/msword Microsoft Word Format
application/news-message-id News posting format
application/octet-stream A raw binary stream
application/postscript PostScript
application/rtf Microsoft rich text format
application/wordperfect5.1 Word Perfect 5.1 format
application/gzip Gzip file compression format
application/zip PKZip file compression format

Table 7.5. MIME Encodings
Encoding Description
7bi The body is not actually encoded. This value simply asserts that text is 7-bit ASCII, with no line longer than 1,000 characters.
8bit The body is not actually encoded. This value asserts that the text may contain 8-bit characters, but has no line longer than 1,000 characters.
binary The body is not actually encoded. This value asserts that the text may contain 8-bit characters and may have lines longer than 1,000 characters.
quoted-printable This encoding is used for text-oriented messages that may contain 8-bit characters (such as messages in non-English character sets). All 8-bit characters are encoded into 7-bit escape sequences, and long lines are folded at 72 characters.
base64 This encoding is used for arbitrary binary data such as audio and images. Every 8-bit character is encoded as a 7-bit string using the uuencode algorithm. The resulting text is then folded into 72-character lines.

Any part of a multipart MIME message may contain a Content-Disposition: header, which is a hint to the mail reader as to how to handle the part. Possible dispositions include attachment, which tells the reader to treat the part's body as an enclosure to be saved to disk, and inline, which tells the reader to try to display the part as a component of the document. For example, a mail reader application may beable to display an inline image in the same window as the textual part of the message. The Content-Disposition: field can also suggest a filename to store attachments under. Another field, Content-Description:, provides an optional human-readable description of the part.

Notice that an e-mail message with a JPEG attachment is really a multipart MIME message containing two parts, one for the text of the message and the other for the JPEG image.

Without going into the format of a MIME message in detail, Figure 7. 5 shows a sample multipart message to give you a feel for the way they work. This message has four parts: a 7-bit text message that appears at the top of the message, a base64-encoded audio file that uses the Microsoft WAV format, a base64-encoded JPEG file, and a final 7-bit part that contains some parting words and the e-mail signature. (The binary enclosures have been truncated to save space.)

Figure 7.5. A sample multipart MIME message

Notice that each part of the message has its own header and body, and that the parts are delimited by a short unique boundary string beginning with a pair of hyphens. The message as a whole has its own header, which is a superset of the RFC 822 Internet mail header, and includes a Content-Type: field of multipart/mixed.

This is pretty much all you need to know about MIME. The MIME modules will do all the rest of the work for you.

Organization of the MIME::* Modules

MIME-Tools has four major parts.

MIME::Entity

MIME::Entity is a MIME message. It contains a MIME::Head (the message header) and a MIME::Body (the message body). In multipart messages, the body may contain other MIME::Entities, and any of these may contain their own MIME::Entities, ad infinitum.

Among other things, MIME::Entity has methods for turning the message into a text string and for mailing the message.



MIME::Head

MIME::Head is the header part of a MIME message. It has methods for getting and setting the various fields.



MIME::Body

MIME::Body represents the body part of a message. Because MIME bodies can get quite large (e.g., audio files), MIME::Body has methods for storing data to disk and reading and writing it in a filehandle-like fashion.



MIME::Parser

The MIME::Parser recursively parses a MIME-encoded message from a file, a filehandle, or in-memory data, and returns a MIME::Entity. You can then extract the parts, or modify and remail the message.



Figure 7.6 is a short example of using MIME::Entity to build a simple message that consists of a text greeting and an audio enclosure.

Figure 7.6. Sending an audio attachment with MIME tools

Lines 13: Load modules We turn on strict type checking and load the MIME::Entity module. It brings in the other modules it needs, including MIME::Head and MIME::Body.

Lines 48: Create top-level MIME::Entity Using the MIME::Entity->build(), method, we create a "top-level" multipart MIME message that contains the two subparts. The arguments to build() include the From: and To: fields, the Subject: line, and a MIME Type of multipart/mixed. This returns a MIME::Entity object.

Lines 918: Attach the text of the message We create the text of the message and store it in a scalar variable. Then, using the top-level MIME entity's attach() method, we incorporate the text data into the growing multipart message, specifying a MIME Type of text/plain, an Encoding of 7bit, and the message text as the Data.

Lines 1923: Attach the audio file We again call attach(), but this time specify a Type of audio/wav and an Encoding of base64. We don't want to read the whole audio file into memory, so we use the Path argument to direct MIME::Entity to the file where the audio data can be found. The Description argument adds a human-readable description of the attachment to the outgoing message.

Lines 2425: Sign the message We call the MIME entity object's sign() utility to append our signature file to the text of the message.

Lines 2627: Send the message We call the send() method to format and mail the completed message using the smtp method.

That's all there is to it. In the next sections we will look at the MIME modules more closely.

MIME::Entity

MIME::Entity is a subclass of Mail::Internet and, like it, represents an entire e-mail message. However, there are some important differences between Mail::Internet and MIME::Entity. Whereas Mail::Internet contains just a single header and body, the body of a MIME::Entity can be composed of multiple parts, each of which may be composed of subparts. Each part and subpart is itself a MIME::Entity (Figure 7.7). Because of these differences, MIME:: Entity adds several methods for manipulating the message's body in an object-oriented fashion.

Figure 7.7. A MIME message can contain an unlimited number of nested attachments

This summary omits some obscure methods. See the MIME::Entity POD documentation for the full details.

The main constructor for MIME::Entity is build(): build() negotiates a large number of constructors. These are the most common:

$entity = MIME::Entity->build(arg1 => $val1, arg2 => $val2, ...)

The build() method is the main constructor for MIME::Entity. It takes a series of named arguments and returns an initialized MIME::Entity object. The following arguments are the most common.

Field name. Any of the RFC 822 or MIME-specific fields can be used as arguments, and the provided value will be incorporated into the message header. As in Mail::Header, you can use an array reference to pass a multivalued field. You should probably confine yourself to using RFC 822 fields, such as From: and To:, because any MIME fields that you provide will override those generated by MIME::Entity.

Data. For single-part entities only, the data to use as the message body. This can be a scalar or an array reference containing lines to be joined to form the body.

Path. For single-part entities only, the path to a file where the data for the body can be found. This can be used to attach to the outgoing message a file that is larger than you could store in main memory.

Boundary. The boundary string to place between parts of a multipart message. MIME::Entity will choose a good default for you; ordinarily you won't want to use this argument.

Description. A human-readable description of the body used as the value of the Content-Description: field.

Disposition. This argument becomes the value of the header's Content-Disposition: field. It may be either attachment or inline, defaulting to inline if the argument is not specified.

Encoding. The value of this argument becomes the Content-Encoding: field. Youshould provide one of 7bit, 8bit, binary, quoted-printable, or base64. Include this argument even if you are sending a simple text message because, if you don't, MIME::Entity defaults to binary. You may also provide a special value of-SUGGEST to have MIME::Entity make a guess based on a byte-by-byte inspection of the entire body.

Filename. The recommended filename for the mail reader to use when saving this entity to disk. If not provided, the recommended filename will be derived from the value of Path.

Type. The MIME type of the entity, text/plain by default. MIME::Entity makes no attempt to guess the MIME type from the file name indicated by the Path argument or from the contents of the Data argument.

Here's the idiom for creating a single-part entity (which may later be attached to a multipart entity):

$part = MIME::Entity->build(To       => 'jdoe@acme.org',
                             Type     => 'image/jpeg',
                             Encoding => 'base64',
                             Path     => '/tmp/pictures/oranges.jpg');
 

And here's the idiom for creating a multipart entity, to which subparts will be added:

$multipart = MIME::Entity->build(To    => 'jdoe@acme.org',
                                  Type  => 'multipart/mixed');
 

Notice that single-part entities should have a body specified using either the Data or the Path arguments. Multipart entities should not.

Once the MIME::Entity is created, you will attach new components to it using add-part() or attach():

$part = $entity->add_part($part [,$offset])

The add_part() method adds a subpart to the multipart MIME::Entity contained in $entity. The $part argument must be a MIME::Entity object. Each multipart MIME::Entity object maintains an array of its subparts, and by default, the new part is appended to the end of the current array. You can modify this by providing an offset argument. The method returns the newly added part.

If you attempt to add a part to a single-part entity, MIME::Entity automagically converts the entity into type multipart/mixed, and reattaches the original contents as a subpart. The entity you are adding then becomes the second subpart on the list. This feature allows you to begin to compose a single-part message and later add attachments without having to start anew.

$part = $entity->attach(arg1 => $val1, arg2 => $val2, ...)

The attach() method is a convenience function that first creates a new MIME::Entity object using build(), and then calls $entity->add_part() to insert the newly created part into the message. The arguments are identical to those of build(). If successful, the method returns the new MIME::Entity.

Several methods provide access to the contents of the entity:

$head = $entity->head([$newhead])

The head() method returns the MIME::Head object associated with the entity. You can then call methods in the head object to examine and change fields. The optional $newhead argument, if provided, can be used to replace the header with a different MIME::Body object.

$body = $entity->bodyhandle([$newbody])

The bodyhandle() method gets or sets the MIME::Body object associated with the entity. You can then use this object to retrieve or modify the unencoded contents of the body. The optional $newbody argument can be used to replace the body with a different MIME::Body object. Don't confuse this method with body(), which returns an array ref containing the text representation of the encoded body.

If the entity is multipart, then there will be no body, in which case bodyhandle(), returns undef. Before trying to fetch the body, you can use the is_multipart(), method to check for this possibility.

$pseudohandle = $entity->open($mode)

The open() method opens the body of the entity for reading or writing, and returns a MIME pseudohandle. As described later in the section on the MIME::Body class, MIME pseudohandles have object methods similar to those in the IO::Handle class (e.g., read(), getline(), and print()), but they are not handles in the true sense of the word. The pseudohandle can be used to retrieve or change the contents of the entity's body.

$mode is one of "r" for reading, or "w" for writing.
 

@parts = $entity->parts($index)

$parts = $entity->parts($index)

@parst= $entity->parts(\@parts)

The parts() method returns the list of MIME::Entity parts in a multipart entity. If called with no arguments, the method returns the entire list of parts; if called with an integer index, it returns the designated part. If passed the reference to an array of parts, the method replaces the current parts with the contents of the array. This allows you delete parts or rearrange their order.

For example, this code fragment reverses the order of the parts in the entity:

$entity->parts([reverse $entity->parts])
 

If the entity is not multipart, parts() returns an empty list.

A variety of methods return information about the Entity:

$type = $entity->mime_type

$type = $entity->effective_type

The mime_type() and effective_type() methods both return the MIME type of the entity's body. Although the two methods usually return the same value, there are some error conditions in which MIME::Parser cannot decode the entity and is therefore unable to return the body in its native form. In this case, mime_type(), returns the type that the body is supposed to be, and effective_type() returns the type that actually returns when you retrieve or save the body data (most probably application/octet-stream). To be safe, use effective_type() when retrieving the body of an entity created by MIME::Parser. For entities you created yourself with MIME::Entity->build(), there's no difference.

$boolean = $entity->is_multipart

The is_multipart() method is a convenience routine that returns true if the entity is multipart, false if it contains a single part only.

$entity->sign(arg1 => $val1, arg2=> $val2, ...)

The sign() method attaches a signature to the message. If the message contains multiple parts, MIME::Entity searches for the first text entity and attaches the signature to that.

The method adds some improvements to the version implemented in Mail::Internet, however you must provide at least one set of named arguments. Possibilities include:

File. This argument allows you to use the signature text contained in a file. Its value should be the path to a local file.

Signature. This argument uses the indicated text as the signature. Its value can be a scalar or a reference to an array of lines.

Force. Sign the entity even if its content type isn't text/*. The value is treated as a Boolean.

Remove. Call remove_sig() to scan for an existing signature and remove it before adding the new signature. The value of this argument is passed to remove_sig(). Provide 0 to disable signature removal entirely.

For example, here's how to add a signature using a scalar value:

$entity->sign(Signature => "That's all folks!");
 

$entity->remove_sig([$nlines])

Remove_sig() scans the last $nlines of the message body as it looks for a line consisting of the characters "--". The line and everything below it is removed. $nlines defaults to 10.

$entity->dump_skeleton([FILEHANDLE])

Dump_skeleton() is a debugging utility. It dumps a text representation of the structure of the entity and its subparts to the indicated filehandle, or, if no filehandle is provided, to standard output.

Finally, several methods are involved in exporting the entity as text and mailing it:

$entity->print([FILEHANDLE])

$entity->print_header([FILEHANDLE])

$entity->print_body([FILEHANDLE])

These three methods, inherited from Mail::Internet, print the encoded text representations of the whole message, the header, or the body, respectively. The parts of a multipart entity are also printed. If no filehandle is provided, it prints to STDOUT.

$arrayref = $entity->header

The header() method, which is inherited from Mail::Internet, returns the text representation of the header as a reference to an array of lines. Don't confuse this with the head() method, which returns a MIME::Head object.

$arrayref = $entity->body

This method, which is inherited from Mail::Internet, returns the body of the message as a reference to an array of lines. The lines are encoded in a form suitable for passing to a mailer. Don't confuse this method with bodyhandle() (discussed next), which returns a MIME::Body object.

$string = $entity->as_string $string

$string = $entity->stringify_body

$string $entity->stringify_header

The as_string() method converts the message into a string, encoding any parts that need to be. The stringify_body() and stringify_header() methods respectively operate on the body and header only.

$result = $entity->send([$method])

The send() method, which is inherited from Mail::Internet, sends off the message using the selected method. I have noticed that some versions of the UNIX mail program have problems with MIME headers, and so it's best to set $method explicitly to either "sendmail" or "smtp".

$entity->purge

If you have received the MIME::Entity object from MIME::Parser, it is likely that the body of the entity or one of its subparts is stored in a temporary file on disk. After you are finished using the object, you should call purge() to remove these temporary files, reclaiming the disk space. This does not happen automatically when the object is destroyed.

MIME::Head

The MIME::Head class contains information about a MIME entity's header. It is returned by the MIME::Entity head() method.

MIME::Head is a class of Mail::Header and inherits most of its methods from there. It is a historical oddity that one module is called "Head" and the other "Header." MIME::Head adds a few utility methods to Mail::Header, the most useful of which are read() and from_file():

$head = MIME::Head->read(FILEHANDLE)

In addition to creating a MIME::Head object manually by calling add() for each header field, you can create a fully initialized header from an open filehandle by calling the read() method. This supplements Mail::Header's read() method, which allows you to read a file only into a previously created object.

$head = MIME::Head->from_file($file)

The from_file() constructor creates a MIME::Head object from the indicated file by opening it and passing the resulting filehandle to read().

All other functions behave as they do in Mail::Header. For example, here is one way to retrieve and change the subject line in a MIME::Entity object:

$old_subject = $entity->head->get('Subject');
 $new_subject = "Re: $old_subject";
 $entity->head->replace(Subject => $new_subject);
 

Like Mail::Header, MIME::Head->get() also returns newlines at the ends of removed field values.

MIME::Body

The MIME::Body class contains information on the body part of a MIME::Entity. MIME::Body objects are returned by the MIME::Entity bodyhandle() method, and are created as needed by the MIME::Entity build() and attach() methods. You will need to interact with MIME::Body objects when parsing incoming MIME-encoded messages.

Because MIME-encoded data can be quite large, an important feature of MIME::Body is its ability to store the data on disk or in memory ("in core" as the MIME-Tools documentation calls it). The methods available in MIME::Body allow you to control where the body data is stored, to read and write it, and to create new MIME::Body objects.

MIME::Body has three subclasses, each specialized for storing data in a different manner:

MIME::Body::File: This subclass stores its body data in a disk file. This is suitable for large binary objects that wouldn't easily fit into main memory.

MIME::Body::Scalar: This subclass stores its body data in a scalar variable in main memory. It's suitable for small pieces of data such as the text part of ane-mail message.

MIME::Body::InCore: This subclass stores its body data in an array reference kept in main memory. It's suitable for larger amounts of text on which you will perform multiple reads or writes.

Normally MIME::Parser creates MIME::Body::File objects to store body data on disk while it is parsing.

$body = MIME::Body::File->new($path)

To create a new MIME::Body object that stores its data to a file, call the MIME::, Body::File->new() method with the path to the file. The file doesn't have to exist, but will be created when you open the body for writing.

$body = MIME::Body::Scalar->new(\$string)

The MIME::Body::Scalar->new() method returns a body object that stores its data in a scalar reference.

$body = MIME::Body::InCore->new($string)

$body = MIME::Body::InCore->new(\$string)

$body = MIME::Body::InCore->new(\@string)

The MIME::Body::InCore class has the most flexible constructor. Internally it stores its data in an array reference, but it can be initialized from a scalar, a reference to a scalar, or a reference to an array.

Once you have a MIME::Body object, you can access its contents by opening it with the open() method.

$pseudohandle = $body->open($mode)

This method takes a single argument that indicates whether to open the body for reading ("r") or writing ("w"). The returned object is a pseudohandle that implements the IO::Handle methods read(), print(), and getline(). However, it is not a true filehandle, so be careful not to pass the returned pseudohandle to any of the built-in procedures such as <> or read().

The following code fragment illustrates how to read the contents of a large MIME::Body stored in a MIME::Entity object and print it to STDOUT. The contents recovered in this way are in their native form, free of any MIME encoding:

$body = $entity->body handle or die "no body";
 $handle = $body->open("r");
 print $data while $handle->read($data,1024);
 

For line-oriented data, we would have used the getline() method instead.

Another code fragment illustrates how to write a MIME::Body's contents using its print() method. If the body is attached to a file, the data is written there. Otherwise, it is written to an in-memory data structure:

$body = $entity->body handle or die "no body";
 $handle = $body->open("w");
 $handle->print($_) while <>;
 

MIME::Body provides a number of convenience methods:

@lines = $body->as_lines

$string = $body->as_string

as_lines() and as_string() are convenience functions that return the entire contents of the body in a single operation. as_lines() opens the body and calls get_line() repeatedly, returning an array of newline-terminated lines. as_string() reads the entire body into a scalar. Because either method can read a large amount of data into memory, you should exercise some caution before calling them.

$path = $body->path([$newpath])

If the body object is attached to a file, as in MIME::Body::File, then path() returns the path to the file or sets it if the optional $newpath argument is provided. If the data is kept in memory, then path() returns undef.

$body->print([FILEHANDLE])

The print() method prints the unencoded body to the indicated filehandle, or, if none is provided, to the currently selected filehandle. Do not confuse this with the print() method provided by the pseudohandles returned by the open() method, which is used to write data into the body object.

$body->purge

Purge unlinks the file associated with the body object, if any. It is not called automatically when the object is destroyed.

MIME::Parser

The last major component of MIME-Tools is the MIME::Parser class, which parses the text representation of a MIME message into its various components. The class is simple enough to use, but has a large number of options that control various aspects of its operation. The short example in Figure 7.8 will give you the general idea.

Figure 7.8. Using MIME::Parser

Lines 13: Load modules We turn on strict type checking and load the MIME::Parser module. It brings in the other modules it needs, including MIME::Entity.

Lines 45: Open a message We recover the name of a file from the command line, which contains a MIME-encoded message, and open it. This filehandle will be passed to the parser later.

Lines 68: Create and configure the parser We create a new parser object by calling MIME::Parser->new(). We then call the newly created object's output_dir(), method to set the directory where the parser will write the body data of extracted enclosures.

Lines 910: Parse the file We pass the open filehandle to the parser's parse(), method. The value returned from the method is a MIME::Entity object corresponding to the top level of the message.

Lines 1114: Print information about the top-level entity To demonstrate that we parsed the message, we recover and print the From: and Subject: lines of the header, calling the entity's head() method to get the MIME::Head object each time. We also print the MIME type of the whole message, and the number of subparts, which we derive from the entity's parts() method.

Lines 1517: Print information about the parts We loop through each part of the message. For each, we call its mime_type() method to retrieve the MIME type, and the path() method of the corresponding MIME::Body to get the name of the file that contains the data.

Line 18: Clean up When we are finished, we call purge() to remove all the parsed body data files.

When I ran the program on a MIME message stored in the file mime.test, this is was the result:

% simple_parse.pl ~/mime.test
 From       = Lincoln Stein <lstein@cshl.org>
 Subject    = testing mime parser
 MIME type  = multipart/mixed
 Parts      = 5
         text/plain   /tmp/msg-1857-1.dat
         audio/wav    /tmp/assimilated.wav
         image/jpeg   /tmp/aw-2-19.jpg
         audio/mpeg   /tmp/NorthwestPassage.mp3
         text/plain   /tmp/msg-1857-2.dat
 

This multipart message contains five parts. The first and last parts contain text data and correspond to the salutation and the signature. The remaining parts are enclosures, consisting of an audio/wav sound file, a JPEG image, and a ripped MP3 track.

We will walk through a more complex example of MIME::Parser in Chapter 8, where we deal with writing Post Office Protocol clients. The example developed there will spawn external viewers to view image and audio attachments.

Because MIME files can be quite large, MIME::Parser's default is to store the parsed MIME::Body parts as files using the MIME::Body::File class. You can control where these files are stored using either the output_dir() or the output_under() methods. The output_dir() method tells MIME::Parser to store the parts directly inside a designated directory. output_under(), on the other hand, creates a two-tier directory. For each parsed e-mail message, MIME::Parser creates a subdirectory under the base directory specified by output_under(), and then writes the MIME::Body::File data there.

In either case, all the temporary files are cleared when you call the top-level MIME::Entity's purge() method. You can instead keep some or all of the parts. To keep some parts, step through the message parts and call purge() selectively on those that you don't want to keep. You can either leave the other parts where they are or move them to a different location for safekeeping. To keep all parsed parts, don't call purge() at all.

Parsing is complex, and the parse() method may die if it encounters any of a number of exceptions. You can catch such exceptions and attempt to perform some error recovery by wrapping the call to parse() in an eval{} block:

$entity = eval { $parser->parse(\*F) };
 warn $@ if $@;
 

Here is a brief list of the major functions in MIME::Parser, starting with the constructor.

$parser = MIME::Parser->new

The new() method creates a new parser object with default settings. It takes no arguments.

$dir = $parser->output_dir

$previous = $parser->output_dir($newdir)

The output_dir() method gets or sets the output directory for the parse. This is the directory in which the various parts and enclosures of the parsed message are (temporarily) stored.

If called with no arguments, it returns the current value of the output directory. If called with a directory path, it sets the output directory and returns its previous value. The default setting is ".", the current directory.

$dir = $parser->output_under

$parser->output_under($basedir [,DirName=>$dir [,Purge=>$purge]])

output_under() changes the temporary file strategy to use a two-tier directory. MIME::Parser creates a subdirectory inside the specified base directory and then places the parsed MIME::Body::File data in the newly created subdirectory.

In addition to $basedir, output_under() accepts two optional named arguments:

DirName. By default, the subdirectory is named by concatenating the current time, process ID, and a sequence number. If you would like a more predictable directory name, you can use DirName to provide a subdirectory name explicitly.

Purge. If you use the same subdirectory name each time you run the program, you might want to set Purge to a true value, in which case output_under() will remove anything in the subdirectory before beginning the parse.

Called with no arguments, output_under() returns the current base directory name. Here are two examples:

# store enclosures in ~/mime_enclosures
 $parser->output_under("$ENV{HOME}/mime_enclosures");
 
 # store enclosures under /tmp in subdirectory "my_mime"
 $parser->output_under("/tmp", DirName=>'my_mime', Purge=>1);
 

The main methods are parse(), parse_data(), and parse_open():

$entity = $parser->parse(\*FILEHANDLE)

The parse() method parses a MIME message by reading its text from an open filehandle. If successful, it returns a MIME::Entity object. Otherwise, parse() can throw any number of run-time exceptions. To catch those exceptions, wrap parse() in an eval{} block as described earlier.

$entity = $parser->parse_data($data)

The parse_data() method parses a MIME message that is contained in memory. $data can be a scalar holding the text of the message, a reference to a scalar, or a reference to an array of scalars. The latter is intended to be used on an array of the message's lines, but can be any array which, when concatenated, yields the text of the message. If successful, parse_data() returns a MIME::Entity object. Otherwise, it generates a number of run-time exceptions.

$entity = $parser->parse_open($file)

The parse_open() method is a convenience function. It opens the file provided, and then passes the resulting filehandle to parse(). It is equivalent to:

open (F,$file);
 $entity = $parser->parse(\*F);
 

Because parse_open() uses Perl's open() function, you can play the usual tricks with pipes. For example:

$entity = $parser->parse_open("zcat ./mailbox.gz |");
 

This uncompresses the compressed mailbox using the zcat program and pipes the result to parse().

Several other methods control the way the parse operates:

$flag = $parser->output_to_core

$parser->output_to_core($flag)

The output_to_core() method controls whether MIME::Parser creates files to hold the decoded body data of MIME::Entity parts, or attempts to keep the data in memory. If $flag is false (the default), then the parts are parsed into disk files. If $flag is true, then MIME::Parser stores the body parts in main memory as MIME::Body::InCore objects.

Since enclosures can be quite large, you should be cautious about doing this. With no arguments, this method returns the current setting of the flag.

$flag = $parser->ignore_errors

$parser->ignore_errors($flag)

The ignore_errors() method controls whether MIME::Parser tolerates certain syntax errors in the MIME message during parsing. If true (the default), then errors generate warnings, but if not, they cause a fatal exception during parse().

$error = $parser->last_error

$head = $parser->last_head

These two methods are useful for dealing with unparseable MIME messages. last_error() returns the last error message generated during the most recent parse. It is set when an error was encountered, and either ignore_errors() is true, or the call to parse() was wrapped in an eval{}.

last_head() returns the top-level MIME::Head object from the last stream we attempted to parse. Even though the body of the message wasn't successfully parsed, we can use the header returned by this method to salvage some information, such as the subject line and the name of the sender.

MIME Example: Mailing Recent CPAN Entries

In this section, we develop an application that combines the Net::FTP module from Chapter 19 with the Mail and MIME modules from this chapter. The program will log into the CPAN FTP site at ftp.perl.org, read the RECENT file that contains a list of modules and packages recently contributed to the site, download them, and incorporate them as attachments into an outgoing e-mail message. The idea is to run the script at weekly intervals to get automatic notification of new CPAN uploads.

Figure 7.9 shows the listing for the application, called mail_recent.pl.

Figure 7.9. The mail_recent.pl script

Lines 14: Load modules We turn on strict syntax checking and load the Net::FTP and MIME::Entity modules.

Lines 59: Define constants We set constants corresponding to the FTP site to connect to, the CPAN directory, and the name of the RECENT file itself. We also declare a constant with the e-mail address of the recipient of the message (in this case, my local username), and a DEBUG constant to turn on verbose progress messages.

Lines 1011: Declare globals The %RETRIEVE global contains the list of files to retrieve from CPAN. $TMPDIR contains the path of a directory in which to store the downloaded files temporarily before mailing them. This is derived from the TMPDIR environment variable, or, if not otherwise specified, from /usr/tmp. Windows and Macintosh users have to check and modify this for their systems.

Lines 1215: Log into CPAN and fetch the RECENT file We create a new Net::FTP object and log into the CPAN mirror. If successful, we change to the directory that contains the archive and call the FTP object's retr() method to return a filehandle from which we can read the RECENT file.

Lines 1723: Parse the RECENT file RECENT contains a list of all files on the CPAN archive that are new or have changed recently, but we don't want to download them all. The files we're interested in have lines that look like this:

modules/by-module/Apache/Apache-Filter-1.011.tar.gz
 modules/by-module/Apache/Apache-iNcom-0.09.tar.gz
 modules/by-module/Audio/Audio-Play-MPG123-0.04.tar.gz
 modules/by-module/Bundle/Bundle-WWW-Search-ALL-1.09.tar.gz
 

We open the file for reading and scan through it one line at a time, looking for lines that match the appropriate pattern. We store the filename and its CPAN path in %RETRIEVE.

After processing the filehandle, we close it.

Lines 2432: Begin the mail message We begin the outgoing mail message with ashort text message that gives the number of enclosures. We create a new MIME::Entity object by calling the build() constructor with the introduction as its initial contents.

Notice that the arguments we pass to build() create a single-part document of type text/plain. Later, when we add the enclosures, we rely on MIME::Entity's ability to convert the message into a multipart message when needed.

Lines 3344: Retrieve modules and attach them to the mail We loop through the filenames stored in %RETRIEVE. For each one, we call the FTP object's get(), method to download the file to the temporary directory. If successful, we use the Filename argument to attach the file to the outgoing mail message by calling the top-level entity's attach() method. Other attach() arguments set the encoding to base64, and the MIME type to application/x-gzip. CPAN files are gzipped by convention. We also add a short description to the attachment; currently it is just a copy of the filename.

Line 45: Add signature to the outgoing mail If there is a file named .signature in the current user's home directory, we call the MIME entity's sign() method to attach it to the end of the message.

Lines 4649: Send the mail We call the entity's send() method to MIME-encode the message and send it via the SMTP protocol. When this is done, we call the entity's purge() method, deleting the downloaded files in the temporary directory. This works because the files became the basis for the MIME-entity bodies via the MIME::Body::File subclass when they were attached to the outgoing message, and purge() recursively deletes these files.

Note that the send() method relies on libnet being correctly configured to find a working SMTP server. If this is not the case, check and fix the Libnet.cfg file.

Line 51: Close FTP connection Our last step is to close the FTP connection by calling the FTP object's quit() method.

Figure 7.10 shows a screenshot of Netscape Navigator displaying the resulting MIME message. Clicking on one of the enclosures will prompt you to save it to disk so that you can unpack and build the module.

Figure 7.10. A mail message sent from mail_recent.pl

A deficiency in the program is that the CPAN filenames can be cryptic, and it isn't always obvious what a package does. A nice enhancement to this script would be to unpack the package, scan through its contents looking for the POD documentation, and extract the description line following the NAME heading. This information could then be used as the MIME::Entity Description: field rather than the filename itself. A simpler alternative would be to enclose the .readme file that frequently (but not always) accompanies a package's .tar.gz file.

Chapter 8. POP, IMAP, and NNTP

Processing Mail and Netnews

In the last chapter we looked at client modules for sending Internet mail. In the first part of this chapter we'll look at modules for receiving mail and processing messages with enclosures (including multimedia enclosures). In the second part, we'll look at clients for the closely related Netnews protocol.

The Post Office Protocol

POP3 and IMAP are the two protocols used most to access Internet mail. Both were designed to allow a user to access mail drops on remote machines, and provide methods to list the contents of the user's mailbox, to download mail for viewing, and to delete messages the user is no longer interested in.

POP3 (Post Office Protocol version 3) is the older and simpler of the two. Described in RFC 1725 and STD 53, it provides a straightforward interface for listing, retrieving, and deleting mail held on a remote server. IMAP (Internet Message Access Protocol), described in RFC 2060, adds sophisticated facilities for managing sets of remote and local mailboxes and synchronizing them when the user connects.

We will consider fetching mail from a POP3 server in this section. There are at least two Perl modules on CPAN for dealing with POP3 servers: Mail::POP3Client, written by Sean Dowd, and Net::POP3, by Graham Barr. Both provide essentially the same functionality but they use different APIs. The most important feature difference between the two is that Net::POP3 allows you to save the contents of a mail message to a filehandle, while Mail::POP3Client reads the entire mail message into memory. Because the ability to save to a filehandle makes a big difference when dealing with large e-mails (such as those containing MIME enclosures), I recommend Net::POP3.

Net::POP3 inherits from Net::Cmd, making it similar in style to Net::FTP and Net::SMTP. You begin by creating a new Net::POP3 object connected to the mailbox host. If this is successful, you log in using a username and password, and then invoke various methods to list the contents of the mailbox, retrieve individual messages, and possibly delete the retrieved messages.

Summarizing a POP3 Mailbox

Figure 8.1 shows a small program that will access a user's mailbox on a maildrop machine and print a brief summary of the senders and subject lines of all new messages. The username and mailhost are specified on the command line using the format username@mailbox.host. The program prompts for the password. Appendix A contains the listing for the PromptUtil.pm package.

Figure 8.1. List entries in a user's inbox

Lines 16: Load modules We bring in the Net::POP3 module to contact the remote POP server, and Mail::Header to parse the retrieved mail headers. We also bring in a new home-brewed utility module, PromptUtil, which provides the get_passwd() function, along with a few other user prompting functions.

Lines 68: Get username, host, and password We get the username and host from the command line, and prompt the user to enter his or her password using the get_passwd() function. The latter turns off terminal echo so that the password is not visible on the screen.

Line 9: Connect to mailbox host We call the Net::POP3 new() method to connect to the indicated host, giving the server 30 seconds in which to respond with the welcome banner. The new() constructor returns a Net::POP3 object.

Lines 1013: Log in and count messages We call the POP3 object's login() method to log in with the user's name and password. If the login is successful, it returns the total number of messages in the user's mailbox; if there are no messages in the mailbox, it returns 0E0 ("zero but true"). This value has a property of 1 if treated in a logical text to test whether login was successful, and is equal to 0 when used to count the number of available messages.

Next we call the POP3 object's last() method to return the number of the last message the user read (0 if none read). We will use this to list the unread messages. Because the message count retrieved by new() can be 0E0, we add zero to it to convert it into a more familiar number. We then print the total number of old and new messages.

Lines 1421: Summarize messages Each message is numbered from 1 to the total of messages in the mailbox. For each one, we call the POP object's top() method to retrieve the message header as a reference to an array of lines, and pass this to Mail::Header->new() for parsing. We call the parsed header's get() method twice to retrieve the Subject: and From: lines, and pass the sender's address to the clean_from() utility subroutine to clean it up a bit. We then print out the message number, sender's name, and subject.

Line 22: Log out The POP object's quit() method logs out cleanly.

Lines 2329: Clean up with the clean_from() subroutine This subroutine cleans up sender addresses a bit, by extracting the sender's name from these three common address formats:

"Lincoln Stein" <lstein@cshl.org>
 Lincoln Stein <stein@cshl.org>
 lstein@cshl.org (Lincoln Stein)
 

When we run this program, we get output like this:

% pop_stats.pl lstein@localhost
 inbox has 6 messages (6 new)
 1 Geoff Winisky             Re: total newbie question                       
 2 Robin Lofving             Server updates                                  
 3 James W Goldblum          Comments part 2                                 
 4 Jessica Raymond           Statistics on Transaction Security              
 5 James W Goldbum           feedback access from each page                  
 6 The Western Web           The Western Web Newsletter
 

Net::POP3 API

The Net::POP3 API is simple. You can log in, log out, list messages, retrieve message headers, retrieve the entire message, and delete messages.

$pop = Net::POP3->new([$host] [,$opt1=>$val1, $opt2=>$val2])

The new() method constructs a new Net::POP3 object. The first, optional, argument is the name or IP address of the mailbox host. This may be followed by a series of option/value pairs. If the host is not provided, it will be retrieved from the Net::Config "POP3_hosts" value specified when the libnet module was installed. The options are listed in Table 8.1.

The ResvPort option is used with some POP3 servers that require clients to connect from reserved ports.

If unsuccessful, new() returns undef and $! is set to some error code.

$messages = $pop->login([$username [,$password]])

The login() method attempts to log into the server using the provided username and password. If one or both of the password and username are not given, then login() looks in the user's .netrc file for the authentication information for the specified host.

If successful, login() returns the total number of messages in the user's mailbox. If there are no messages, login() returns the following point number 0E0, which will be treated as true when used in a logical context to test whether login was successful, but evaluate to zero when treated in a numeric context to count the number of available messages. If an error occurs, login() returns undef and $pop->message() contains an error message.

If the login fails, you may try again or try to login using apop(). Some servers close the connection after a number of unsuccessful login attempts. With the exception of quit(), none of the other methods will be accepted until the server accepts the login.

Some POP servers support the APOP command.

$messages = $pop->apop($username,$password)

APOP is similar to a standard login, but instead of sending passwords across the network in the clear, it uses a challenge/ response system to authenticate the user without processing cleartext passwords. Unlike login(), .netrc is not consulted if the username and password are absent. The value returned from apop() is the same as that from login().

Table 8.1. Net::POP3->new() Options
Option Description Default
Port Remote port to connect to POP3(110)
ResvPort Local port to bind to ephemeral port
Timeout Second to wait for a response 120
Debug Turn on verbose debugging undef

Many POP3 servers need special configuration before the APOP command will authenticate correctly. In particular, most UNIX servers need a password file distinct from the system password file.

Once login is successful, you can use a variety of methods to access the mailbox:

$last_msgnum = $pop->last

POP messages are numbered from 1 through the total number of messages in the inbox. At any time, the user may have read one or more messages using the RETR command (see below), but not deleted them from the inbox. Last() returns the highest number from the set of retrieved messages, or 0 if no messages have been retrieved. New messages begin at $last_msgnum+1.

Many POP servers store the last-read information between connections; however, a few discard this information.

$arrayref = $pop->get($msgnum [,FILEHANDLE])

Following a successful login, the get() method retrieves the message indicated by its message number, using the POP3 RETR command. It can be called with a filehandle, in which case the contents of the message (both header and body) are written to the filehandle. Otherwise, the get() method returns an array reference containing the lines of the message.

$handle = $pop->getfh($msgnum)

This is similar to get(), but the return value is a tied filehandle. Reading from this handle returns the contents of the message. When the handle returns end-of-file, it should be closed and discarded.

$flag = $pop->delete($msgnum)

delete() marks the indicated message for deletion. Marked messages are not removed until the quit() method is called, and can be unmarked by calling reset().

$arrayref = $pop->top($msgnum[,$lines])

The top() method returns the header of the indicated message as a reference to an array of lines. This format is suitable for passing to the Mail::Header->new() method. If the optional $lines argument is provided, then the indicated number of lines of the message body are included.

$has = $pop->list

$size = $pop->list($msgnum)

The list() method returns information on the size of mailbox messages. Called without arguments, it returns a hash reference in which the keys are message IDs, and the values are the sizes of the messages, in bytes. Called with a message ID, the method returns the size of the indicated message, or if an invalid message number was provided, it returns undef.

($msg_count,$size) = $pop->popstat

pop_stat() returns a two-element list that consists of the number of undeleted messages in the mailbox and the size of the mailbox in bytes.

$uidl = $pop->uidl([$msgnum])

The uidl() method returns a unique identifier for the given message number. Called without an argument, it returns a hash reference in which the keys are the message numbers for the entire mailbox, and the values are their unique identifiers. This method is intended to help clients track messages across sessions, since the message numbers change as the mailbox grows and shrinks.

When you call the quit() method, messages marked for deletion are removed unless you reset() first.

$pop->reset

This method resets the mailbox, unmarking the messages marked for deletion.

$pop->quit

The quit() method quits the remote server and disconnects. Any messages marked for deletion are removed from the mailbox.

Retrieving and Processing MIME Messages via POP

To show Net::POP3 in a real-world application, I developed a script called pop_fetch.pl that combines Net::POP3 and MIME::Parse. Figure 8.2 shows a session with this program. After I invoke it with the mailbox name in user@host form, the program prompts me for my login password. The program reports the number of messages in my mailbox, and then displays the date, sender, and subject line of the first, prompting me to read it or skip to the next.

Figure 8.2. A session with pop_fetch.pl

I choose to read the message, causing the program to display the message header and the text part of the body. It then reports that the message has two attachments (technically, two non-text/plain MIME parts). For each one, the program prompts me for the disposition of the attachment. For the first attachment, of type image/jpeg, I choose to view the attachment, causing my favorite image viewer (the XV application, written by John Bradley) to pop up in a new window and show the picture. After I quit the viewer, the script prompts me again for the disposition. This time I choose to save the image under its default name.

The next attachment is a Microsoft Word document. No viewer is defined for this document type, so the prompt only allows the attachment to be saved to disk.

After dealing with the last attachment, the program prompts me to keep or delete the entire message from the inbox, or to quit. I quit. The program then moves on to the next unprocessed message.

The pop_fetch.pl Script

pop_fetch.pl is broken into two parts. The main part, listed in Figure 8.3, handles the user interface. A smaller module named PopParser.pm subclasses Net::POP3 in such a way that messages retrieved from a POP3 mailbox are automatically parsed into MIME::Entities.

Figure 8.3. The pop_fetch.pl script

We'll look at pop_fetch.pl first.

Lines 16: Activate taint checking and load modules Since we will be launching external applications (the viewers) based on information from untrusted sources, we need to be careful to check for tainted variables. The -T switch turns on taint checking. (See Chapter 10 for more information.)

We load PopParser and PromptUtil, two modules developed for this application.

Lines 711: Define viewers We define constants for certain external viewers. For example, HTML files are invoked with the command lynx %s, where %s is replaced by the name of the HTML file to view. For variety, some of the viewers are implemented as pipes. For example, the player for MP3 audio files is invoked as mpg123-, where the - symbol tells the player to take its input from standard input.

At the end of the code walkthrough, we'll discuss replacing this section of code with the standard mailcap facility.

Lines 1213: Taint check precautions As explained in more depth in Chapter 10, taint checking will not let us run with an untrusted path or with several other environment variables set. We set PATH to a known, trusted state, and delete four other environment variables that affect the way that commands are processed.

Lines 1420: Recover username and mailbox host We process the command-line arguments to recover the name of the user and the POP3 host.

The $entity global holds the most recent parsed MIME::Entity object. We make it global so that the script's END{} block can detect it and call its purge(), method in case the user quits the program prematurely. This will delete all temporary files from disk. For similar reasons, we intercept the INT signal to exit gracefully if the user hits the interrupt key.

Lines 2126: Log in to mailbox server The PopParser.pm module defines a new subclass of Net::POP3 that inherits all the behavior of the base class, but returns parsed MIME::Entity objects from the get() method rather than the raw text of the message. We create a new PopParser object connected to the mailbox host. If this is successful, we call get_passwd() (imported from the PromptUtil module) to get the user's login password.

Next, we authenticate ourselves to the remote host. We don't know a priori whether the server accepts APOP authentication or the less secure cleartext authentication method, so we try them both. If the apop() method fails, then we try login(). If that also fails, we die with an error message.

If login is successful, we print the number of messages returned by the apop(), or login() methods. We add 0 to the message count to convert the 0E0 result code into a more user-friendly integer.

Lines 2738: Enter the main message-processing loop We now enter the main message-processing loop. For each message, we fetch its header by calling the PopParser object's top() method (which is inherited without modification from Net::POP3). The header text is then passed to our print_header() method to display it as a one-line message summary.

We ask the user if he or she wants to read the message, and if so, we call the PopParser object's get() method, which fetches the indicated message, parses it, and returns a MIME::Entity object. This object is passed to our display_entity(), subroutine in order to display it and its subparts. When display_entity() is finished, we delete the entity's temporary files by calling its purge() method.

The last step is to ask the user if he or she wants to delete the message from the remote mailbox, and if the answer is affirmative, we call the PopParser's delete() method.

Lines 3945: print_header() subroutine The print_header() subroutine takes an array ref containing the header lines returned by $POP->top() and turns it into a one-line summary for display. Although we could have used the Mail::Header module for this purpose, it turned out to be cleaner to parse the header into a hash ourselves using the idiom of the Mail::SMTP mail client of Figure 7.2.

The output line contains the date, sender, and subject line, separated by tabs.

Lines 4660: display_entity() subroutine This subroutine is responsible for displaying a MIME::Entity object. It is called recursively to process both the top-level object and each of its subparts (and sub-subparts, if any).

We begin by retrieving the message's mail header as a MIME::Head object. If the header contains a From: field, then we can conclude that it is the top-level entity. We print out the header so that the user can see the sender's name and other fields.

Next we check whether the entity is multipart, by calling its is_multipart(), method. If this method returns true, then we call handle_multipart() to prompt the user for each of the parts. Otherwise, we invoke a subroutine called display_part() to display the contents of the entity.

Lines 6178: The handle_multipart() subroutine The handle_multipart(), subroutine loops through and processes each part of a multipart MIME::Entity object. We begin by calling the entity's parts() method to fetch each of the subparts as a MIME::Entity object. We then call Perl's grep() built-in twice to sort the parts into those that we can display directly and those that are to be treated as attachments that must be displayed using an external application. Since we know how to display only plain text, we sort on the MIME type text/plain.

For each of the text/plain parts, we call the display_part() subroutine to print the message body to the screen. If there are nontext attachments, we prompt the user for permission to display them, and if so, invoke display_entity(), recursively on each attachment. This recursive invocation of display_entity(), allows for attachments that are themselves multipart messages, such as forwarded e-mails.

Lines 7999: The display_part() subroutine The display_part() subroutine is invoked to display a single-part MIME::Entity. Depending on the user's wishes, its job is to display, save, or ignore the part.

We begin by retrieving the part's header, MIME type, description, and suggested filename for saving (derived from the Content-Disposition: header, if present). We also recover the part's MIME::Body object by calling its bodyhandle() method. This object gives us access to the body's unencoded content.

If the part's MIME type is text/plain, we do not need an external viewer to display it. We simply call the body object's print() method to print the contents to standard output. Otherwise, we call get_viewer() to return the name of an external viewer that can display this MIME type. We print a summary that contains the part's MIME type, description, and suggested filename, and then prompt the user to view or save the part. Depending on the user's response, we invoke save_body() to save the part's content to disk, or display_body() to launch the external viewer to display it. This continues in a loop until the user chooses "n" to go to the next part.

If no viewer is defined for the part's MIME type, the user's only option is to save the content to disk.

Lines 100114: The save_body() subroutine The save_body() subroutine accepts a MIME::Body object and a default filename. It gives the user the opportunity to change the filename, opens the file, and writes the contents of the part to disk.

The most interesting feature of this subroutine is the way that we treat the default filename for the attachment. This filename is derived from the Content-Disposition: header, and as such is untrusted data. Someone who wanted to spoil our day could choose a malicious pathname, such as one that would overwrite a treasured configuration file. For this reason we forbid absolute pathnames and those that contain the ".." relative path component. We also forbid filenames that contain unusual characters such as shell metacharacters. Having satisfied these tests, we extract the filename using a pattern match, thereby untainting it. Perl will now allow us to open the file for writing. We do so and write the attachment's contents to it by calling the MIME::Body object's print() method.

Lines 116128: The display_body () subroutine The display_body() subroutine is called to launch an external viewer to display an attachment. It is passed a MIME::Body object, and a command to launch an external viewer to display it.

To make this application a bit more interesting, we allow for two types of viewers: those that read the body data from a file on disk and those that read from standard input. The former are distinguished from the latter by containing the symbol %s, which will be replaced by the filename before execution (this is a standard convention in the UNIX mailcap file).

We begin by calling the MIME::Body object's path() method to obtain the path to the temporary file in which the object's data is stored. We then use this in a pattern substitution to replace any occurrence of %s in the viewer command. If the substitution is successful, it returns a true value, and we call system() to invoke the command.

Otherwise, we assume that the viewer will read the data from standard input. In this case, we use open() to open a pipe to the viewer command, and invoke the body object's print() method to print to the pipe filehandle. Before doing this, however, we set the PIPE handler to IGNORE to avoid the program terminating unexpectedly because of a recalcitrant viewer.

This subroutine works correctly both for line-oriented applications, such as the Lynx HTML viewer, and for windowing applications, such as XV.

Lines 129137: The get_viewer() subroutine get_viewer() is an extremely simple subroutine that uses a pattern match to examine the MIME type of the attachment and selects a hard-coded viewer for it.

Lines 138140: END{} block This script's END{} block takes care of calling any leftover MIME::Entity's purge() method. This deletes temporary files that might be left around if the user interrupted the script's execution unexpectedly.

The PopParser Module

The other main component of the pop_fetch.pl script is the PopParser module, which subclasses Net::POP3 in a way that enables it to parse MIME messages at the same time that it is fetching them. Figure 8.4 shows the code for PopParser.pm.

Figure 8.4. The PopParser module

Lines 16: Load modules We turn on strict checking and load the Net::POP3 and MIME::Parser modules. We use the global @ISA array to tell Perl that PopParser is a subclass of Net::POP3.

Lines 715: Override the new() method We override the Net::POP3 new() method in order to create and initialize a MIME::Parser for later use. We first invoke our parent's new() method to create the basic object and connect to the remote host, create and configure a MIME::Parser object, and store the parser for later use by invoking our parser() accessor method.

Lines 1621: The parser() method This method is an accessor for the MIME::Parser object created during the call to new(). If we are called with a parser object on our subroutine stack, we store it among our instance variables. Otherwise, we return the current parser object to the caller.

The way we stash the parser object among our instance variables looks weird, but it is the conventional way to store instance variables in filehandle objects:

${*$self}{'pp_parser'} = shift
 

What this is doing is referencing a hash in the symbol table that happens to have the same name as our filehandle. We then index into that as if it were a conventionally created hash. We need to store our instance variables this way because Net::POP3 ultimately descends from IO::Handle, which creates and manipulates blessed filehandles, rather than more conventional blessed hash references.

Lines 2230: Override the get() method The last part of this module overrides the Net::POP3 get() method. We are called with the number of the message to retrieve, which we pass to getfh() to obtain a tied filehandle from which to read the desired message. The returned filehandle is immediately passed to our stored MIME::Parser object to parse the message and return a MIME::Entity object.

The nice thing about the design of the PopParser module is that message retrieval and message parsing occur in tandem, rather than downloading the entire message and parsing it in two steps. This saves considerable time for long messages.

There are a number of useful enhancements one could make to pop_fetch.pl. The one with the greatest impact would be to expand the range and flexibility of the viewers for nontext attachments. The best way to do this would be to provide support for the system /etc/mailcap and per-user .mailcap files, which on UNIX systems map MIME types to external viewers. This would allow the user to install and customize viewers without editing the code. Support for the mailcap system can be found in the Mail::Cap module, which is part of Graham Barr's MailTools package. To use Mail::Cap in the pop_fetch.pl script, replace lines 7 through 11 of Figure 8.3 with these lines:

use Mail::Cap;
 my $mc = Mail::Cap-new;
 

This brings in the Mail::Cap module and creates a new Mail::Cap object that we can use to fetch information from the mailcap configuration files.

Replace line 90, which invokes the get_viewer() subroutine, with the equivalent call from Mail::Cap:

my $viewer = $mc->viewCmd($type);
 

This takes a MIME type and returns the command to invoke to view it if one is defined.

The last modification is to replace line 97, which invokes the display_ body() subroutine to invoke the viewer on the body of an attachment, with the Mail::Cap equivalent:

$mc->view($type,$body->path);
 

This call looks up the appropriate view command for the specified MIME type, does any needed string substitutions, and invokes the command using system().

We no longer need the get_viewer() and display_body() subroutines, because Mail::Cap takes care of their functionality. You can delete them.

Other potential enhancements to this script include:

  • the ability to reply to messages

  • the ability to list old and new messages and jump directly to messages of interest

  • a full windowing display using the text-mode Curses module or the graphical PerlTK package, both available from CPAN

With a little work, you could turn this script into a full-featured e-mail client!

The IMAP Protocol

The POP3 protocol was designed to handle the case of a user who spends most of his or her time working on a single machine. The mail client's job is to fetch the user's unread mail from time to time from the remote mailbox server. The user then reads the mail and possibly sorts it into several local mail folders.

Keeping track of mail becomes more complicated, however, when the user is moving around a lot: working on a desktop in the office, a laptop while traveling, and another desktop at home. In this case, the user wants to see the same set of mail files no matter where he or she happens to be working. The Internet Message Access Protocol (IMAP) satisfies these needs by managing multiple remote mail folders and transparently synchronizing them with local copies, providing the user with a consistent view of stored e-mail. IMAP clients also provide the user with the ability to work off-line, and with sophisticated server-side message search functions.

Unfortunately, the IMAP protocol is also rather complex and it does certain things that the simple request/response model of Net::POP3 can't easily handle. Among other things, IMAP servers send unsolicited messages to the client from time to time, for example to alert the client that new mail has arrived. No fewer than three Perl modules on CPAN deal with IMAP: Mail::IMAPClient, Net::IMAP, and Net::IMAP::Simple.

Mail::IMAPClient, written by David Kernen, provides the most functionality of the three, providing methods for issuing all of the IMAP commands. However, Mail::IMAPClient does not do such a good job at mapping the IMAP server's responses onto easily handled Perl objects. To use this module, you'll need RFC 2060 on hand and be prepared to parse the server responses yourself.

Net::IMAP, written by Kevin Johnson, does a better job at handling the server's responses, and provides a nifty callback interface that allows you to intercept and handle server events. Unfortunately, the module is in alpha stage and the interfaces are changing. Also, at the time this book was written, the module's documentation was incomplete.

Currently, the most usable interface to IMAP is Joao Fonseca's Net::IMAP::Simple, which provides access to the subset of IMAP that is most like POP3. In fact, Net::IMAP::Simple shares much of Net::POP3's method interface and is, to a large extent, plug compatible.

Like Net::POP3, you work with Net::IMAP::Simple by calling its new(), method to connect to an IMAP server host, authenticate with login(), list messages with list() and top(), and retrieve messages with get(). Unlike Net::POP3, Net::IMAP::Simple has no apop() method for authenticating without plaintext passwords. To make up for this deficiency, it has the ability to work with multiple remote mailboxes. Net::IMAP::Simple can list the user's mailboxes, create and delete them, and copy messages from one folder to another.

Summarizing an IMAP Mailbox

The pop_stats.pl program from Figure 8.1 summarizes the contents of a POP3 mailbox. We'll now enhance this program to summarize an IMAP mailbox. As an added feature, the new script, named imap_stats.pl, indicates whether a message has been read. You call it like pop_stats.pl, but with an additional optional command-line argument that indicates the name of the mailbox to summarize:

% pop_stats.pl lstein@localhoszt gd_bug_reports
 lstein@localhost password:
 gd has 6 messages (2 new)
   1 Honza Pazdziora       Re: ANNOUNCE: GD::Latin2 patch (fwd)    read
   2 Gurusamy Sarathy      Re: patches for GD by Gurusamy Sarathy  read
   3 Honza Pazdziora       Re: ANNOUNCE: GD::Latin2 patch (fwd)    read
   4 Erik Bertelsen        GD-1.18, 2 minor typos                  read
   5 Erik Bertelsen        GD fails om some GIF's                unread
   6 Honza Pazdziora       GDlib version 1.3                     unread
 

Figure 8.5 lists imap_stats.pl.

Figure 8.5. Summarize an IMAP mailbox

Lines 15: Load modules We load Net::IMAP::Simple, Mail::Header, and the Prompt Util module used in earlier examples.

Lines 69: Process command-line arguments We parse out the username and mailbox host from the first command-line argument, and recover the mailbox name from the second. If no mailbox name is provided, we default to INBOX, which is the default mailbox name on many UNIX systems. We then prompt for the user's password.

Lines 1014: Connect to remote host We call the Net::IMAP::Simple->new(), method to connect to the designated host, and then call login() to authenticate. If these steps are successful, we call the object's select() method to select the indicated mailbox. This call returns the total number of messages in the mailbox, or if the mailbox is empty or missing, undef. We fetch the number of the last message read by calling last().

Lines 1524: List contents of the mailbox We loop through each of the messages from first to last. For each one, we fetch the header by calling top(), parse it into a Mail::Header object, and retrieve the Subject: and From: fields. We also call the IMAP object's seen() method to determine whether the message has been retrieved. We then print the message number, sender, subject line, and read status.

Lines 2632: clean_from() subroutine This is the same subroutine we saw in the earlier version of this program. It cleans up the sender addresses.

The Net::IMAP::Simple API

Although Net::IMAP::Simple is very similar to Net::POP3, there are some important differences. The most dramatic difference is that Net::IMAP::Simple does not inherit from Net::Cmd and, therefore, does not implement the message() or code() methods. Furthermore, Net::IMAP::Simple is not a subclass of IO::Socket and, therefore, cannot be treated like a filehandle.

The new() and login() methods are similar to Net::POP3:

$imap = Net::IMAP::Simple->new($host [,$opt1=>$val1, $opt2=>$val2])

The new() method constructs a new Net::IMAP::Simple object. The first argument is the name of the host, and is not optional (unlike the Net::POP3 equivalent). This is followed by a series of options that are passed directly to IO::Socket::INET.

If unsuccessful, new() returns undef and $! is set to some error code. Otherwise, it returns a Net::IMAP::Simple object connected to the server.

$messages = $imap->login($username,$password)

The login() method attempts to log into the server using the provided username and password. The username and password are required, also a departure from Net::POP3. If successful, the method returns the number of messages in the user's default mailbox, normally INBOX. Otherwise, login() returns undef.

Note that login() does not return 0E0 for a default mailbox that happens to be empty. The correct test for a successful login is to test for a defined return value.

Several functions provide access to mailboxes.

@mailboxes = $imap->mailboxes

The mailboxes() method returns a list of all the user's mailboxes.

$messages = $imap->select($mailbox)

The select() method selects a mailbox by name, making it current. If the mailbox exists, select() returns the number of messages it contains (0 for a mailbox that happens to be empty). If the mailbox does not exist, the method returns undef and the current mailbox is not changed.

$success = $imap->create_mailbox($mailbox)

$success = $imap->delete_mailbox($mailbox)

$success = $imap->rename_mailbox($old_name,$new_name)

The create_mailbox(), delete_mailbox(), and rename_mailbox() methods attempt to create, delete, and rename the named mailbox, respectively. They return true if successful, and false otherwise.

Once you have selected a mailbox, you can examine and retrieve its contents.

$last_msgnum = $imap->last

The last() method returns the highest number of the read messages in the current mailbox, just as Net::POP3 does. You can also get this information by calling the seen() method, as described below.

$arrayref = $imap->get($msgnum)

The get() method retrieves the message indicated by the provided message number from the current mailbox. The return value is a reference to an array containing the message lines.

$handle = $imap->getfh($msgnum)

This is similar to get() but the return value is a filehandle that can be read from in order to retrieve the indicated message. This method differs from the similarly named Net::POP3 method by returning a filehandle opened on a temporary file, rather than a tied filehandle. This means that the entire message is transferred from the remote server to the local machine behind the scenes before you can begin to ork with it.

$flag = $imap->delete($msgnum)

The delete() method marks the indicated message for deletion from the current mailbox. Marked messages are not removed until the quit() method is called. However, there is no reset() call to undo a deletion.

$arrayref = $imap->top($msgnum)

The top() method returns the header of the indicated message as a reference to an array of lines. This format is suitable for passing to the Mail::Header->new() method. There is no option for fetching a certain number of lines from the body text.

$has = $imap->list

$size = $imap->list($msgnum)

The list() method returns information on the size of mailbox messages. Called without arguments, it returns a hash reference in which the keys are message IDs, and the values are the sizes of the messages, in bytes. Called with a message ID, the method returns the size of the indicated message, or if an invalid message number was provided, it returns undef.

$flag = $imap->seen($msgnum)

The seen() method returns true if the indicated message has been read (by calling the get() method), or false if it has not.

$success = $imap->copy($msgnum,$mailbox_destination)

The copy() method attempts to copy the indicated message from the current mailbox to the indicated destination mailbox. If successful, the method returns a true value and the indicated message is appended to the end of its destination. You may wish to call delete() to remove the message from its original mailbox.

When you are finished, the quit() method will clean up:

$imap->quit()

quit() takes no arguments. It deletes all marked messages and logs off.

Internet News Clients

The Netnews system dates back to 1979, when researchers at Duke University and the University of North Carolina designed a system to distribute discussion group postings that would overcome the limitations of simple mailing lists [Spencer & Lawrence, 1998]. This rapidly grew into Usenet, a global Internet-based bulletin-board system comprising thousands of named newsgroups.

Because of its sheer size (more than 34,000 newsgroups and daily news flow rates measured in the gigabytes), Usenet has been diminishing in favor among Internet users. However, there has been a resurgence of interest recently in using Netnews for private discussion servers, helpdesk applications, and other roles in corporate intranets.

Netnews is organized in a two-level hierarchy. At the upper level are the newsgroups. These have long meaningful names like comp.graphics.rendering.raytracing. Each newsgroup, in turn, contains zero or more articles. Users post articles to their local Netnews server, and the Netnews distribution software takes care of distributing the article to other servers. Within a day or so, a copy of the article appears on every Netnews server in the world. Articles live on Netnews for some period before they are expired. Depending on each server's storage capacity, a message may be held for a few days or a few weeks before expiring it. A few large Netnews servers, such as the one at http://www.deja.com/, hold news articles indefinitely.

Newsgroups are organized using a hierarchical namespace. For example, all newsgroups beginning with comp. are supposed to have something to do with computers or computer science, and all those beginning with soc.religion. are supposed to concern religion in society. The creation and destruction of newsgroups, by and large, is controlled by a number of senior administrators. The exception is the alt hierarchy, in which newsgroups can be created willy-nilly by anyone who desires to do so. Some very interesting material resides in these groups.

Regardless of its position in the namespace hierarchy, a newsgroup can be moderated or unmoderated. Moderated groups are "closed." Only a small number of people (typically a single moderator) have the right to post to the newsgroup. When others attempt to post to the newsgroup, their posting is automatically forwarded to the moderator via e-mail. The moderator then posts the message at his or her discretion. Anyone can post to unmoderated groups. The posted article is visible immediately on the local server, and diffuses quickly throughout the system.

Articles are structured like e-mails, and in fact share the same RFC 822 specification. Figure 8.6 shows a news article recently posted to comp.lang.perl.modules. The article consists of a message header and body. The header contains several fields that you will recognize from the standard e-mail, such as the Subject: and From: lines, and some fields that are specific to news articles, such as Article:, Path:, Message-ID:, Distribution:, and References:. Many of these fields are added automatically by the Netnews server.

Figure 8.6. A typical Netnews article

To construct a valid Netnews article, you need only take a standard e-mail message and add a Newsgroups: header containing a comma-delimited list of newsgroups to post to. Another frequently used article header is Distribution:, which limits the distribution of an article. Valid values for Distribution: depend on the setup of your local Netnews server, but they are typically organized geographically. For example, the usa distribution limits message propagation to the political boundaries of the United States, and nj limits distribution to New Jersey. The most common distribution is world, which allows the article to propagate globally.

Other article header fields have special meaning to the Netnews system, and can be used to create control messages that cancel articles, add or delete newsgroups, and perform other special functions. See [Spencer and Lawrence 1998] for information on constructing your own control messages.

Netnews interoperates well with MIME. An article can have any number of MIME-specific headers, parts, and subparts, and MIME-savvy news readers are able to decode and display the parts.

Articles can be identified in either of two ways. Within a newsgroup, an article can be identified by its message number within the group. For example, the article shown in Figure 8.6 is message number 36,166 of the newsgroup comp.lang.perl.modules. Because articles are constantly expiring and being replaced by new ones, the number of the first message in a group is usually not 1, but more often a high number. The message number for an article is stable on any given news server. On two subsequent days, you can retrieve the same article by entering a particular newsgroup and retrieving the same message number. However, message numbers are not stable across servers. An article's number on one news server may be quite different on another server.

The other way to identify articles is by the message ID. The message ID of the sample article is <397a6e8d.524144494f47414741@radiogaga.harz.de>, including the angle brackets at either side. Message IDs are unique, global identifiers that remain the same from server to server.

Net::NNTP

Historically, Netnews has been distributed in a number of ways, but the dominant mode is now the Net News Transfer Protocol, or NNTP, described in RFC 977. NNTP is used both by Netnews servers to share articles among themselves and by client applications to scan and retrieve articles of interest. Graham Barr's Net::NNTP module, part of the libnet utilities, provides access to NNTP servers.

Like other members of the libnet clan, Net::NNTP descends from Net::Cmd and inherits that module's methods. Its API is similar to Net::POP3 and Net::IMAP::Simple. You connect to a remote Netnews server, creating a new Net::NNTP object, and use this object to communicate with the server. You can list and filter newsgroups, make a particular newsgroup current, list articles, download them, and post new articles.

newsgroup_stats.pl is a short script that uses Net::NNTP to find all newsgroups that match a pattern and count the number of articles in each. For example, to find all the newsgroups that have something to do with Perl, we could search for the pattern "*.perl*" (the output has been edited slightly for space):

% newsgroup_stats.pl '*.perl*'
 alt.comp.perlcgi.freelance                               454 articles
 alt.flame.marshal.perlman                                  3 articles
 alt.music.perl-jam                                        11 articles
 alt.perl.sockets                                          45 articles
 comp.lang.perl.announce                                   43 articles
 comp.lang.perl.misc                                    18940 articles
 comp.lang.perl.moderated                                 622 articles
 comp.lang.perl.modules                                  2240 articles
 comp.lang.perl.tk                                        779 articles
 cz.comp.lang.perl                                         63 articles
 de.comp.lang.perl.cgi                                   1989 articles
 han.comp.lang.perl                                       174 articles
 it.comp.lang.perl                                        715 articles
 japan.comp.lang.perl                                      53 articles
 

Notice that the pattern match wasn't perfect, and we matched alt.music.perl-jam as well as newsgroups that have to do with the language. Figure 8.7 lists the code.

Figure 8.7. match_newsgroup.pl script

Lines 13: Load modules We turn on strict checking and load the Net::NNTP module.

Line 4: Create new Net::NNTP object We call Net::NNTP->new() to connect to a Netnews host. If the host isn't specified explicitly, then Net::NNTP chooses a suitable host from environment variables or the default NNTP server specified when libnet was installed.

Lines 56: Print stats and quit For each argument on the command line, we call the print_stats() print_stats() subroutine to look up the pattern and print out matching newsgroups. We then call the NNTP object's quit() method.

Lines 717: print_stats() subroutine In the print_stats() subroutine we invoke the NNTP object's newsgroups() method to find newsgroups that match a pattern. If successful, newsgroups() returns a hash reference in which the keys are newsgroup names and the values are brief descriptions of the newsgroup.

If the value returned by newsgroups() is undef or empty, we return. Otherwise, we sort the groups alphabetically by name, and loop through them. For each group, we call the NNTP object's group() method to return a list containing information about the number of articles in the group and the message numbers of the first and last articles. We print the newsgroup name and the number of articles it contains.

The Net::NNTP API

The Net::NNTP API can be divided roughly into those methods that deal with the server as a whole, those that affect entire newsgroups, and those that concern individual articles in a newsgroup.

Newsgroups can be referred to by name or, for some methods, by a wildcard pattern match. The pattern-matching system used by most NNTP servers is similar to that used by the UNIX and DOS shells. "*" matches zero or more of any characters, "?" matches exactly one character, and a set of characters enclosed in square brackets, as in "[abc]", matches any member of the set. Bracketed character sets can also contain character ranges, as in "[09]" to match the digits 0 through 9, and the "^" character may be used to invert a setfor example, "[^AZ]" to match any character that is not in the range A through Z. Any other character matches itself exactly once. As in the shell (and unlike Perl's regular expression operations), NNTP patterns are automatically anchored to the beginning and end of the target string.

Articles can be referred to by their number in the current newsgroup, by their unique message IDs, or, for some methods, by a range of numbers. In the latter case, the range is specified by providing a reference to a two-element array containing the first and last message numbers of the range. Some methods allow you to search for particular articles by looking for wildcard patterns in the header or body of the message using the same syntax as newsgroup name wildcards.

Other methods accept times and dates, as for example, the newgroups() method that searches for newsgroups created after a particular date. In all cases, the time is expressed in its native Perl form as seconds since the epoch, the same as that returned by the time() built-in.

In addition to the basic NNTP functions, many servers implement a number of extension commands. These extensions make it easier to search a server for articles that match certain criteria and to summarize quickly the contents of a discussion group. Naturally, not all servers support all extensions, and in such cases the corresponding method usually returns undef In the discussion that follows, methods that depend on NNTP extensions are marked.

We look first at methods that affect the server itself.

$nntp = Net::NNTP->new([$host],[$option1=>$val1,$option2=>$val2])

The new() method attempts to connect to an NNTP server. The $host argument is the DNS name or IP address of the server. If not specified, Net::NNTP looks for the server name in the NNTPSERVER and NEWSHOSTS environment variables first, and then in the Net::Config nntp_hosts key. If none of these variables is set, the Netnews host defaults to news.

In addition to the options accepted by IO::Socket::INET, Net::NNTP recognizes the name/value pairs shown in Table 8.2.

By default, when Net::NNTP connects to a server, it announces that it is a news reader rather than a news transport agent (a program chiefly responsible for bulk transfer of messages). If you want to act like a news transfer agent and really know what you're doing, provide new() with the option Reader=>0.

$success = $nntp->authinfo($user => $password)

Some NNTP servers require the user to log in before accessing any information. The authinfo() method takes a username and password, and returns true if the credentials were accepted.

$ok = $nntp->postok()

postok() returns true if the server allows posting of new articles. Even though the server as a whole may allow posting, individual moderated newsgroups may not.

$time = $nntp->date()

The date() method returns the time and date on the remote server, as the number of seconds since the epoch. You can convert this into a human-readable time-date string using the localtime() or gmtime() functions.

Table 8.2. Net::NNTP->new() Options
Option Description Default
Timeout Seconds to wait for response from server 120
Debug Turn on verbose debugging information undef
Port Numeric or symbolic name of port to connect to 119
Reader Act like a news reader 1

$nntp->slave()

$nntp->reader() [extension]

The slave() method puts the NNTP server into a mode in which it expects to engage in bulk transfer with the client. The reader() method engages a mode more suitable for the interactive transfer of individual articles. Unless explicitly disabled, reader() is issued automatically by the new() method.

$nntp->quit()

The quit() method cleans up and severs the connection with the server. This is also issued automatically when the NNTP object is destroyed.

Once created, you can query an NNTP object for information about newsgroups. The following methods deal with newsgroup-level functions.

$group_info = $nntp->list()

The list() method returns information about all active newsgroups. The return value is a hash reference in which each key is the name of a newsgroup, and each value is a reference to a three-element array that contains group information. The elements of the array are [$first,$last,$postok], where $first and $last are the message numbers of the first and last articles in the group, and $postok is "y" if the posting is allowed to the group or "m" if the group is moderated.

$group = $nntp->group([$group])

($articles,$first,$last,$name) = $nntp->group([$group])

The group() method gets or sets the current group. Called with a group name as its argument, it sets the current group used by the various article-retrieval methods.

Called without arguments, the method returns information about the current group. In a scalar context, the method returns the group name. In a list context, the method returns a four-element list that contains the number of articles in the group, the message numbers of the first and last articles, and the name of the group.

$group_info = $nntp->newgroups($since [,$distributions])

The newgroups() method works like list(), but returns only newsgroups that have been created more recently than the date specified in $since. The date must be expressed in seconds since the epoch as returned by time().

The $distributions argument, if provided, limits the returned list to those newsgroups that are restricted to the specified distribution(s). You may provide a single distribution name as a string, such as nj, or a reference to an array of distributions, such as ['nj','ct','ny'] for the New York tristate region.

$new_articles = $nntp->newnews($since [,$groups [,$distributions]])

The newnews() method returns a list of articles that have been posted since the time value indicated by $since. You may optionally provide a group pattern or a reference to an array of patterns in $groups, and a distribution pattern or reference to an array of distribution patterns in $distributions.

If successful, the method returns a reference to an array that contains the message IDs of all the matching articles. You may then use the article() and/or articlefh() methods described below to fetch the contents of the articles. This method is chiefly of use for mirroring an entire group or set of groups.

$group_info = $nntp->active([$pattern]) [extension]

The active() method works like list(), but limits retrieval to those newsgroup that match the wildcard pattern $pattern. If no pattern is specified, active() is functionally equivalent to list().

This method and the ones that follow all use common extensions to the NTTP protocol, and are not guaranteed to work with all NNTP servers.

$group_descriptions = $nntp->newsgroups([$pattern]) [extension]

$group_descriptions = $nntp->xgtitle($pattern) [extension]

The newsgroups() method takes a newsgroup wildcard pattern and returns a hash reference in which the keys are group names and the values are brief text descriptions of the group. Because many Netnews sites have given up on keeping track of all the newsgroups (which appear and disappear very dynamically), descriptions are not guaranteed to be available. In such cases, they appear as the string "No description", as "?", or simply as an empty string.

xgtitle() is another extension method that is functionally equivalent to newsgroups(), with the exception that the group pattern argument is required.

$group_times = $nntp->active_times()[extension]

This method returns a reference to a hash in which the keys are newsgroup names and the values are a reference to a two-element list giving the time the group was created and the ID of its creator. The creator ID may be something useful, like an e-mail address, but is more often something unhelpful, like "newsmaster."

$distributions = $nntp->distributions() [extension]

$subscriptions = $nntp->subscriptions() [extension]

These two methods return information about local server distribution and subscription lists. Local distributions can be used to control the propagation of messages in the local area network; for example, a company that is running multiple NNTP servers might define a distribution named engineering. Subscription lists are used to recommend lists of suggested newsgroups to new users of the system.

distributions() returns a hash reference in which the keys are distribution names and the values are human-readable descriptions of the distributions. subscriptions() returns a hash reference in which the keys are subscription list names and the values are array references containing the newsgroups that belong to the subscription list.

Once a group is selected using the group() method, you can list and retrieve articles. Net::NNTP gives you the option of retrieving a specific article by specifying its ID or message number, or iteratively fetching articles in sequence, starting at the current message number and working upward.

$article_arrayref = $nntp->article ([$message] [,FILEHANDLE])

The article() method retrieves the indicated article. If $message is numeric, it is interpreted as a message number in the current newsgroup. Net::NNTP returns the contents of the indicated message, and sets the current message pointer to this article. An absent first argument or a value of undef retrieves the current article.

If the first argument is not numeric, Net::NNTP treats it as the article's unique message ID. Net::NNTP retrieves the article, but does not change the position of the current message pointer. In fact, when referring to an article by its message ID, it is not necessary for the indicated article to belong to the current group

The optional filehandle argument can be used to write the article to the specified destination. Otherwise, the article's contents (header, blank separating line, and body) are returned as a reference to an array containing the lines of the article.

Should something go wrong, article() returns undef and $nntp->message contains an error message from the server. A common error is "no such article number in this group", which can be issued even when the message number is in range because of articles that expire or are cancelled while the NNTP session is active.

Other article-retrieval methods are more specialized.

$header_arrayref = $nntp->head([$message] [,FILEHANDLE])

$body_arrayref = $nntp->body([$message] [,FILEHANDLE])

The head() and body() methods work like article() but retrieve only the header or body of the article, respectively.

$fh = $nntp->articlefh([$message])

$fh = $nntp->headfh([$message])

$fh = $nntp->bodyfh([$message])

These three methods act like article(), head(), and body(), but return a tied filehandle from which the contents of the article can be retrieved. After using the filehandle, you should close it. For example, here is one way to read message 10000 of the current newsgroup:

$fh = $nntp->articlefh(10000) or die $nntp->message;
 while (<$fh>) {
       print;
 }
 

$msgid = $nntp->next()

$msgid = $nntp->last()

$msgid = $nntp->nntpstat($message)

The next(), last(), and nntpstat() methods control the current article pointer. next() advances the current article pointer to the next article in the newsgroup, and last() moves the pointer to the previous entry. The nntpstat() method moves the current article pointer to the position indicated by $message, which should be a valid message number. After setting the current article pointer, all three methods return the message ID of the current article.

Net::NMTP allows you to post new articles using the post(), postfh(), and ihave() methods.

$success = $nntp->post([$message])

The post() method posts an article to Netnews. The posted article does not have to be directed to the current newsgroup; in fact, the news server ignores the current newsgroup when accepting an article and looks only at the contents of its Newsgroups: header. The article may be provided as an array containing the lines of the article or as a reference to such an array. Alternatively, you may call post() with no arguments and use the datasend() and dataend() methods inherited from Net::Cmd to send the article one line at a time.

If successful, post() returns a true value. Otherwise, it returns undef and $nntp->message contains an error message from the server.

$fh = $nntp->postfh()

The postfh() method provides an alternative interface for posting an article. If the server allows posting, this method returns a tied filehandle to which you can print the contents of the article. After finishing, be sure to close the filehandle. The result code from close() indicates whether the article was accepted by the server.

$wants_it = $nntp->ihave($messageID[,$message])

The ihave() method is chiefly of use for clients that are acting as news relays. The method asks the Netnews server whether it wishes to accept the article whose ID is $messageID.

If the server indicates its assent, it returns a true result. The article must then be transferred to the server, either by providing the article's contents in the $message argument or by sending the article one line at a time using the Net::Cmd datasend() and dataend() methods. $message can be an array of article lines or a reference to such an array.

Last, several methods allow you to search for particular articles of interest.

$header_has = $nntp->xhdr($header,$message_range) [extension]

$header_has = $nntp->xpat($header,$pattern,$message_range) [extension]

$references = $nntp->xrover($message_range) [extension]

The xhdr() method is an extension function that allows you to retrieve the value of a header field from multiple articles. The $header article is the name of an article header field, such as "Subject". $message_range is either a single message number or a reference to a two-element array containing the first and last messages in the desired range. If successful, xhdr() returns a hash reference in which the keys are the message numbers (not IDs) and the values are the requested header fields.

The header field is case-insensitive. However, not all headers can be retrieved in this way because NNTP servers typically index only that subset of the headers used to generate overview listings (see the next method).

The xpat() method is similar to xhdr(), but it filters the articles returned for those with $header fields that match the wildcard pattern in $pattern. The xrover() method returns the cross-reference fields for articles in the specified range. It is functionally identical to:

$xref = $nntp->xhdr('References',[$start,$end]);
 

The result of this call is a hash reference in which the keys are message numbers and the values are the message IDs that the article refers to. These are typically used to reconstruct discussion threads.

$overview_has = $nntp->xover($message_range) [extension]

$format_arrayref = $nntp->overview_fmt() [extension]

The overview_fmt() and xover() methods return newsgroup "overview" information. The overview is a summary of selected article header fields; it typically contains the Subject: line, References:, article Date:, and article length. It is used by newsreaders to index, sort, and thread articles.

Pass the xover() method a message range (a single message number or a reference to an array containing the extremes of the range). If successful, the method's return value is a hash reference in which each key is a message number and each value is a reference to an array of the overview fields.

To discover what these fields are, call the overview_fmt() method. It returns an array reference containing field names in the order in which they appear in the arrays returned by xover(). Each field is followed by a colon and, occasionally, by a server-specific modifier. For example, my laboratory's Netnews server returns the following overview fields:

('Subject:','From:','Date:','Message- ID:','References:',
 'Bytes:','Lines:','Xref:full')
 

If you would prefer the values of the overview array to be a hash reference rather than an array reference, you can use the small subroutine shown here to do the transformation. The trick is to use the list of field names returned by overview_fmt() to create a hash slice to which we assign the article overview array:

sub get_overview {
   my ($nntp,$range) = @_;
   my @fields = map {/(\w+):/&& $1} @{$nntp->overview_fmt};
   my $over   = $nntp->xover($range) || return;
   foreach (keys %$over) {
     my $h = {};
     @{$h}{@fields}= @{$over->{$_}};
     $over->{$_} = $h;
   }
   return $over;
 }
 

Use the subroutine like this:

$over = get_overview($nntp,[30000,31000]);
 

The returned value will have a structure like this:

{
 30000 => {
        'Bytes' => 2704
        'Date' => 'Sat, 27 May 2000 19:35:10 GMT'
        'From' => 'mr_lowell@my-deja.com'
        'Lines' => 72
        'Message-ID' => '<8gp81d$cuo$1@nnrp1.deja.com>'
        'References' => ''
        'Subject' => 'mod_perl make test'
        'Xref' => 'Xref: rQdQ comp.lang.perl.modules:34162'
         },
  30001 => {
        'Bytes' => 1117
        'Date' => 'Sat, 27 May 2000 20:28:22 GMT'
        'From' => 'Robert Gasiorowski <gasior@snet.net>'
        'Lines' => 6
        'Message-ID' => '<39303E6A.88397549@snet.net>'
        'References' => ''
        'Subject' => 'installing module as non-root'
        'Xref' => 'Xref: rQdQ comp.lang.perl.modules:34163'
        },
      ....
 }
 

A News-to-Mail Gateway

The last code example of this chapter is a custom news-to-mail gateway. It periodically scans Netnews for articles of interest, bundles them into a MIME message, and mails them via Internet mail. Each time the script is run it keeps track of the messages it has previously sent and only sends messages that haven't been seen before.

You control the script's scope by specifying a list of newsgroups and, optionally, one or more patterns to search for in the subject lines of the articles contained in the newsgroups. If you don't specify any subject-line patterns, the script fetches the entire contents of the listed newsgroups.

The subject-line patterns take advantage of Perl's pattern-matching engine, and can be any regular expression. For performance reasons, however, we use the built-in NNTP wildcard patterns for newsgroup names.

The following command searches the comp.lang.perl.* newsgroups for articles that have the word "Socket" or "socket" in the subject line. Matching articles will be mailed to the local e-mail address lstein. Options include -subject, to specify the subject pattern match, -mail to set the mail recipient(s), and -v to turn on verbose progress messages.

% scan_newsgroups.pl -v -mail lstein -subj '[sS]ocket' 'comp.lang.perl.*'
 Searching comp.lang.perl.misc for matches
 Fetching overview for comp.lang.perl.misc
 found 39 matching articles
 Searching comp.lang.perl.announce for matches
 Fetching overview for comp.lang.perl.announce
 found 0 matching articles
 Searching comp.lang.perl.tk for matches
 Fetching overview for comp.lang.perl.tk
 found 1 matching articles
 Searching comp.lang.perl.modules for matches
 Fetching overview for comp.lang.perl.modules
 found 4 matching articles
 44 articles, 40 unseen
 sending e-mail message to lstein
 

The received e-mail message contains a brief prologue that describes the search and newsgroup patterns, followed by the matching articles. Each article is attached as an enclosure of MIME type message/rfc822. Depending on the reader's mail-reading software, the enclosures are displayed as either in-line components of the message or attachments. The result is particularly nice in the Netscape mail reader (Figure 8.8) because each article is displayed using fancy fonts and hyperlinks.

Figure 8.8. E-mail message sent from scan_newsgroups.pl

Figure 8.9 lists the code for scan_newsgroups.pl.

Figure 8.9. The scan_newsgroup.pl script

Lines 17: Load modules We load the Net::NNTP and MIME::Entity modules, as well as the Getopt::Long module for argument processing. We need to keep track of all the messages that we have found during previous runs of the script, and the easiest way to do that is to keep the message IDs in an indexed DBM database. However, we don't know a priori what DBM library is available, so we import the AnyDBM_File module, which chooses a library for us. The code contained in the BEGIN{} block changes the DBM library search order, as described in the AnyDBM_File documentation.

We also load the Fcntl module in order to have access to several constants needed to initialize the DBM file.

Lines 922: Define constants We choose a name for the DBM file, a file named .newscache in the user's home directory, and create a usage message.

Lines 2325: Declare globals The first line of globals correspond to command-line options. The second line of globals are various data structures manipulated by the script. The %Seen hash will be tied to the DBM file. Its keys are the message IDs of articles that we have previously retrieved. %Articles contains information about the articles recovered during the current search. Its keys are message IDs, and its values are hash references of header fields derived from the overview index. Last, @Fields contains the list of header fields returned by the xover() method.

Lines 2634: Process command-line arguments We call GetOptions() to process the command-line options, and then check consistency of the arguments. If the e-mail recipient isn't explicitly given on the command line, we default to the user's login name.

Lines 3536: Open connection to Netnews server We open a connection to the Netnews server by calling Net::NNTP->new(). If the server isn't explicitly given on the command line, the $SERVER option is undefined and Net::NNTP picks a suitable default.

Lines 3739: Open DBM file We tie %Seen to the .newscache file using the AnyDBM_File module. The options passed to tie() cause the file to be opened read/write and to be created with file mode 0640 (-rw-r-----), if it doesn't already exist.

Lines 4041: Compile the pattern match For efficiency's sake, we compile the pattern matches into an anonymous subroutine. This subroutine takes the text of a subject line and returns true if all the patterns match, and false otherwise. The match_code() subroutine takes the list of pattern matches, compiles them, and returns an appropriate code reference.

Lines 4243: Expand newsgroup patterns We pass the list of newsgroups to a subroutine named expand_newsgroups(). It calls the NNTP server to expand the wildcards in the list of newsgroups and returns the expanded list of newsgroup names.

Lines 4445: Search for matching articles We loop through the expanded list of newsgroups and call grep_group() for each one. The arguments to grep_group() consist of the newsgroup name and a code reference to filter them. Internally, grep_group() accumulates the matched articles' message IDs into the %Articles hash. We do it this way because the same article may be cross-posted to several related newsgroups; using the article IDs in a hash avoids accumulating duplicates.

Lines 4648: Filter out articles already seen We use Perl's grep() function to filter out articles whose message IDs are already present in the tied %Seen hash. New article IDs are added to the hash so that on subsequent runs we will know that we've seen them. The unseen article IDs are assigned to the @to_fetch array.

If the user ran the script with the -all option, we short-circuit the grep() operation so that all articles are retrieved, including those we've seen before. This does not affect the updating of the tied %Seen hash.

Lines 4952: Add articles to an outgoing mail message and quit We pass the list of article IDs to send_mail(), which retrieves their contents and adds them to an outgoing mail message. We then call the NNTP object's quit() method to disconnect from the server, and exit ourselves.

Lines 5362: The match_code() subroutine The match_code() subroutine takes a list of zero or more patterns and constructs a code reference on the fly. The subroutine is built up line-by-line in a scalar variable called $code. The subroutine is designed to return true only if all the patterns match the passed subject line. If no patterns are specified, the subroutine returns true by default. If the -insensitive option was passed to the script, we do case-insensitive pattern matches with the i flag. Otherwise, we do case-sensitive matches.

After constructing the subroutine code, we eval() it and return the result to the caller. If the eval() fails (presumably because of an error in one or more of the regular expressions), we propagate the error message and die.

Lines 6373: The expand_newsgroups() subroutine The expand_newsgroups(), subroutine takes a list of newsgroup patterns and calls the NNTP object's newsgroups() method on each of them in turn, expanding them to a list of valid newsgroup names. If a newsgroup contains no wildcards, we just pass it back unchanged.

Lines 7485: The grep_group() subroutine grep_group() scans the specified newsgroup for articles whose subject lines match a set of patterns. The patterns are provided in the form of a code reference that returns true if the subject line matches.

We call the get_overview() subroutine to return the server's overview index for the newsgroup. get_overview() returns a hash reference in which each key is a message number and each value is a hash of indexed header fields. We step through each message, recover its Subject: and Message-ID: fields, and pass the subject field to the pattern-matching code reference. If the code reference returns false, we go on to the next article. Otherwise, we add the article's message ID and overview data to the %Articles global.

When all articles have been examined, we return to the caller the number of those that matched.

Lines 89102: The get_overview() subroutine The get_overview() subroutine used here is a slight improvement over the version shown earlier. We start by calling the NNTP object's group() method, recovering the newsgroup's first and last message numbers. We then call the object's overview_fmt() method to retrieve the names of the fields in the overview index. Since this information isn't going to change during the lifetime of the script, however, we cache it in the @Fields global and call overview_fmt() only if the global is empty. Before assigning to @Fields, we clean up the field names by removing the ":" and anything following it.

We recover the overview for the entire newsgroup by calling the xover() method for the range spanning the first and last article numbers. We now loop through the keys of the returned overview hash, replacing its array reference values, which lists fields by position, with anonymous hashes that list fields by name. In addition to recording the header fields that occur in the article itself, we record a pseudofield named Message-Number: that contains the group name and message number in the form group.name:number. We use this information during e-mail construction to create the default name for the article enclosure.

Lines 103124: The send_mail() subroutine send_mail() is called with an array of article IDs to fetch, and is responsible for constructing a multipart MIME message containing each article as an attachment.

We create a short message prologue that summarizes the program's run-time options and create a new MIME::Entity by calling the build() method. The message starts as a single-part message of type text/plain, but is automatically promoted to a multipart message as soon as we start attaching articles to it.

We then call attach_article() for each article listed in $to_fetch. This array may be empty, in which case we make no attachments. When all articles have been attached, we call the MIME entity's smtpsend() method to send out the mail using the Mail::Mailer SMTP method, and clean up any temporary files by calling the entity's purge() method.

Lines 125134: The attach_article() subroutine For the indicated message ID we fetch the entire article's contents as an array of lines by calling the NNTP object's article() method. We then attach the article to the outgoing mail message, specifying a MIME type of message/rfc822, a description corresponding to the article's subject line, and a suggested filename derived from the article's newsgroup and message number (taken from the global %Articles hash).

An interesting feature of this script is the fact that because we are storing unique global message IDs in the .newscache hashed database, we can switch to a different NNTP server without worrying about retrieving articles we have already seen.

Chapter 9. Web Clients

In the previous chapters we reviewed client modules for sending and receiving Internet mail, transferring files via FTP, and interacting with Netnews servers. In this chapter we look at LWP, the Library for Web access in Perl. LWP provides a unified API for interacting with Web, FTP, News and Mail servers, as well as with more obscure services such as Gopher.

With LWP you can (1) request a document from a remote Web server using its URL; (2) POST data to a Web server, emulating the submission of a fill-out form; (3) mirror a document on a remote Web server in such a way that the document is transferred only if it is more recent than the local copy; (4) parse HTML documents to recover links and other interesting features; (5) format HTML documents as text and postscript; and (6) handle cookies, HTTP redirects, proxy servers, and HTTP user authentication. Indeed, LWP implements all the functionality one needs to write a Web browser in Perl, and if you download and install the Perl-TK distribution, you'll find it contains a fully functional graphical Web browser written on top of LWP.

The base LWP distribution contains 35 modules, and another dozen modules are required for HTML parsing and formatting. Because of its size and scope, we will skim the surface of LWP. For an exhaustive treatment, see LWP's POD documentation, or the excellent, but now somewhat dated Web Client Programming with Perl [Wong 1999].

Installing LWP

The first version of LWP appeared in 1995, and was written by Martijn Koster and Gisle Aas. It has since been maintained and extended by Gisle Aas, with help from many contributors.

The basic LWP library, distributed via CPAN in the file libwww-X.XX.tar.gz (where X.XX is the most recent version number), provides supports for the HTTP, FTP, Gopher, SMTP, NNTP, and HTTPS (HTTP over Secure Sockets Layer) protocols. However, before you can install it, you must install a number of prerequisite modules:

URI

URL parsing and manipulation

Net::FTP

to support ftp://URLs

MIME::Base64

to support HTTP Basic authentication

Digest::MD5

to support HTTP Digest authentication

HTML::HeadParser

for finding the <BASE> tag in HTML headers

You could download and install each of these modules separately, but the easiest way is to install LWP and all its prerequisites in batch mode using the standard CPAN module. Here is how to do this from the command line:

% perl -MCPAN -e 'install Bundle::LWP'
 					

This loads the CPAN module and then calls the install() function to download, build, and install LWP and all the ancillary modules that it needs to run.

The HTML-parsing and HTML-formatting modules were once bundled with LWP, but are now distributed as separate packages named HTML-Parser and HTML-Formatter, respectively. They each have a number of prerequisites, and again, the easiest way to install them is via the CPAN module using this command:

% perl -CPAN -e 'install HTML::Parser' -e 'install HTML::Formatter'
 					

If you want to install these libraries manually, here is the list of the packages that you need to download and install:

HTML-Parser

HTML parsing

HTML-Tree

HTML syntax-tree generation

Font-AFM

Postscript font metrics

HTML-Format

HTML formatting

To use the HTTPS (secure HTTP) protocol, you must install one of the Perl SSL modules, IO::Socket::SSL, as well as OpenSSL, the open source SSL library that IO::Socket::SSL depends on. OpenSSL is available from http://www.openssl.org/.

LWP is pure Perl. You don't need a C compiler to install it. In addition to the module files, when you install LWP you get four scripts, which serve as examples of how to use the library, as well as useful utilities in their own right. The scripts are:

  • lwp-request Fetch a URL and display it.

  • lwp-download Download a document to disk, suitable for files too large to hold in memory.

  • lwp-mirror Mirror a document on a remote server, updating only the local copy if the remote one is more recent.

  • lwp-rget Copy an entire document hierarchy recursively.

LWP Basics

Figure 9.1 shows a script that downloads the URL given on the command line. If successful, the document is printed to standard output. Otherwise, the script dies with an appropriate error message. For example, to download the HTML source for Yahoo's weather page, located at http://www.yahoo.com/r/wt, you would call the script like this:

Figure 9.1. Fetch a URL using LWP's object-oriented interface

graphics/09fig01.gif

% get_url.pl http://www.yahoo.com/r/wt > weather.html
 					

The script can just as easily be used to download a file from an FTP server like this:

% get_url.pl ftp://www.cpan.org/CPAN/RECENT
 					

The script will even fetch news articles, provided you know the message ID:

% get_url.pl news:3965e1e8.1936939@enews.newsguy.com
 					

All this functionality is contained in a script just 10 lines long.

Lines 13: Load modules We turn on strict syntax checking and load the LWP module.

Line 4: Read URL We read the desired URL from the command line.

Line 5: Create an LWP::UserAgent We create a new LWP::UserAgent object by calling its new() method. The user agent knows how to make requests on remote servers and return their responses.

Line 6: Create a new HTTP::Request We call HTTP::Request->new(), passing it a request method of "GET" and the desired URL. This returns a new HTTP::Request object.

Line 7: Make the request We pass the newly created HTTP::Request to the user agent's request() method. This issues a request on the remote server, returning an HTTP::Response.

Lines 89: Print response We call the response object's is_success() method to determine whether the request was successful. If not, we die with the server's error message, returned by the response object's message() method. Otherwise, we retrieve and print the response contents by calling the response object's content() method.

Short as it is, this script illustrates the major components of the LWP library. HTTP::Request contains information about the outgoing request from the client to the server. Requests can be simple objects containing little more than a URL, as shown here, or can be complex objects containing cookies, authentication information, and arguments to be passed to server scripts.

HTTP::Response encapsulates the information returned from the server to the client. Response objects contain status information, plus the document contents itself.

LWP::UserAgent intermediates between client and server, transmitting HTTP::Requests to the remote server, and translating the server's response into an HTTP::Response to return to client code.

In addition to its object-oriented mode, LWP offers a simplified procedural interface called LWP::Simple. Figure 9.2 shows the same script rewritten using this module. After loading the LWP::Simple module, we fetch the desired URL from the command line and pass it to getprint(). This function attempts to retrieve the indicated URL. If successful, it prints its content to standard output. Otherwise, it prints a message describing the error to STDERR.

Figure 9.2. Fetch a URL using LWP::Simple procedural interface

graphics/09fig02.gif

In fact, we could reduce Figure 9.1 even further to this one-line command:

% perl -MLWP::Simple -e 'getprint shift' http://www.yahoo.com/r/wt
 					

The procedural interface is suitable for fetching and mirroring Web documents when you do not need control over the outgoing request and you do not wish to examine the response in detail. The object-oriented interface is there when you need to customize the outgoing request by providing authentication information and data to post to a server script, or by changing other header information passed to the server. The object-oriented interface also allows you to interrogate the response to recover detailed information about the remote server and the returned document.

HTTP::Request

The Web paradigm generalizes all client/server interactions to a client request and a server response. The client request consists of a Uniform Resource Locator (URL) and a request method. The URL, which is known in the LWP documentation by its more general name, URI (for Uniform Resource Identifier), contains information on the network protocol to use and the server to contact. Each protocol uses different conventions in its URLs. The protocols supported by LWP include:

HTTP The Hypertext Transfer Protocol, the "native" Web protocol described in RFCs 1945 and 2616, and the one used by all Web servers. HTTP URLs have this familiar form:

http://server.name:port/path/to/document
 

The http: at the beginning identifies the protocol. This is followed by the server DNS name, IP address, and, optionally, the port the server is listening on. The remainder of the URL is the path to the document.

FTP A document stored on an FTP server. FTP URLs have this form:

ftp://server.name:port/path/to/document
 

GOPHER A document stored on a server running the now rarely used gopher protocol. Gopher URLs have this form:

gopher://server.name:port/path/to/document
 

SMTP LWP can send mail messages via SMTP servers using mailto: URLs. These have the form:

mailto:user@some.host
 

where user@some.host is the recipient's e-mail address. Notice that the location of the SMTP server isn't part of the URL. LWP uses local configuration information to identify the server.

NNTP LWP can retrieve a news posting from an NNTP server given the ID of the message you wish to retrieve. The URL format is:

news:message-id
 

As in mail: URLs, there is no way to specify the particular NNTP server. A suitable server is identified automatically using Net::NNTP's rules (see Chapter 8).

In addition to the URL, each request has a method. The request method indicates the type of transaction that is requested. A number of methods are defined, but the most frequent ones are:

GET Fetch a copy of the document indicated by the URL. This is the most common way of fetching a Web page.

PUT Replace or create the document indicated by the URL with the document contained in the request. This is most commonly seen in the FTP protocol when uploading a file, but is also used by some Web page editors.

POST Send some information to the indicated URL. It was designed for posting e-mail messages and news articles, but was long ago appropriated for use in sending fill-out forms to CGI scripts and other server-side programs.

DELETE Delete the document indicated by the URL. This is used to delete files from FTP servers and by some Web-based editing systems.

HEAD Return information about the indicated document without changing or downloading it.

HTTP protocol requests can also contain other information. Each request includes a header that contains a set of RFC 822-like fields. Common fields include Accept:, indicating the MIME type(s) the client is prepared to receive, User-agent:, containing the name and version of the client software, and Content-type:, which describes the MIME type of the request content, if any. Other fields handle user authentication for password-protected URLs.

For the PUT and POST methods, but not for GET, HEAD, and DELETE, the request also contains content data. For PUT, the content is the document to upload to the location indicated by the URL. For POST, the content is some data to send, such as the contents of a fill-out form to send to a CGI script.

The LWP library uses a class named HTTP::Request to represent all requests, even those that do not use the HTTP protocol. You construct a request by calling HTTP::Request->new() with the name of the desired request method and the URL you wish to apply the request to. For HTTP requests, you can then add or alter the outgoing headers to do such things as add authentication information or HTTP cookies. If the request method expects content data, you'll normally add the data to the request object using its content () method.

The API description that follows lists the most frequently used HTTP:: Request methods. Some of them are defined in HTTP::Request directly, and others are inherited.

One begins by creating a new request object with HTTP::Request->new().

$request = HTTP::Request->new($method, $url [,$header [,$content]])

The new() method constructs a new HTTP::Request. It takes a minimum of two arguments. $method is the name of the request method, such as GET, and $url is the URL to act on. The URL can be a simple string or a reference to a URI object created using the URI module. We will not discuss the URI module in detail here, but it provides functionality for dissecting out the various parts of URLs.

new() also accepts optional header and content arguments. $header should be a reference to an HTTP::Headers object. However, we will not go over the HTTP::Headers API because it's easier to allow HTTP::Request to create a default headers object and then customize it after the object is created. $content is a string containing whatever content you wish to send to the server.

Once the request object is created, the header() method can be used to examine or change header fields.

$request->header($field1 => $val1, $field2 => $val2 ...)

@values = $request->header($field)

Call header() with one or more field/value pairs to set the indicated fields, or with a single field name to retrieve the current values. When called with a field name, header() returns the current value of the field. In a list context, header() returns multivalued fields as a list; in a scalar context, it returns the values separated by commas.

This example sets the Referer: field, which indicates the URL of the document that referred to the one currently being requested:

$request->header(Referer => 'http://www.yahoo.com/whats_cool.html')
 

An HTTP header field can be multivalued. For example, a client may have a Cookie: field for each cookie assigned to it by the server. You can set multivalued field values by using an array reference as the value, or by passing a string in which values are separated by commas. This example sets the Accept: field, which is a multivalued list of the MIME types that the client is willing to accept:

$request->header(Accept => ['text/html','text/plain','text/rtf'])
 

Alternatively, you can use the push_header() method described later to set multivalued fields.

$request->push_header($field => $value)

The push_header() method appends the indicated value to the end of the field, creating it if it does not already exist, and making it multivalued otherwise. $value can be a scalar or an array reference.

$request->remove_header(@fields)

The remove_header() method deletes the indicated fields.

A variety of methods provide shortcuts for dealing with header fields.

$request->scan(\&sub)

The scan() method iterates over each of the HTTP headers in turn, invoking the code reference provided in \&sub. The subroutine you provide will be called with two arguments consisting of the field name and its value. For multivalued fields, the subroutine is invoked once for each value.

$request->date()

$request->expires()

$request->last_modified()

$request->if_modified_since()

$request->content_type()

$request->content_length()

$request->referer()

$request->user_agent()

These methods belong to a family of 19 convenience methods that allow you to get and set a number of common unique-valued fields. Called without an argument, they return the current value of the field. Called with a single argument, they set it. The methods that deal with dates use system time format, as returned by time().

Three methods allow you to set and examine one request's content.

$request->content([$content])

$request->content_ref

The content() method sets the content of the outgoing request. If no argument is provided, it returns the current content value, if any. content_ref() returns a reference to the content, and can be used to manipulate the content directly.

When POSTing a fill-out form query to a dynamic Web page, you use content() to set the query string, and call content_type() to set the MIME type to either application/x-www-form-urlencoded or multipart/form-data.

It is also possible to generate content dynamically by passing content() a reference to a piece of code that returns the content. LWP invokes the subroutine repeatedly until it returns an empty string. This facility is useful for PUT requests to FTP servers, and POST requests to mail and news servers. However, it's inconvenient to use with HTTP servers because the Content-Length: field must be filled out before sending the request. If you know the length of the dynamically generated content in advance, you can set it using the content_length() method.

$request->add_content($data)

This method appends some data to the end of the existing content, if any. It is useful when reading content from a file.

Finally, several methods allow you to change the URL and method.

$request->uri([$uri])

This method gets or sets the outgoing request's URI.

$request->method([$method])

This method() gets or sets the outgoing request's method.

$string = $request->as_string

The as_string() method returns the outgoing request as a string, often used during debugging.

HTTP::Response

Once a request is issued, LWP returns the server's response in the form of an HTTP::Response object. HTTP::Response is used even for non-HTTP protocols, such as FTP.

HTTP::Response objects contain status information that reports the outcome of the request, and header information that provides meta-information about the transaction and the requested document. For GET and POST requests, the HTTP::Response usually contains content data.

The status information is available both as a numeric status code and as a short human-readable message. When using the HTTP protocol, there are more than a dozen status codes, the most common of which are listed in Table 9.1. Although the text of the messages varies slightly from server to server, the codes are standardized and fall into three general categories:

  • Informational codes, in the range 100 through 199, are informational status codes issued before the request is complete.

  • Success codes, which occupy the 200 through 299 range, indicate successful outcomes.

  • Redirection status codes, in the 300 through 399 range, indicate that the requested URL has moved elsewhere. These are commonly encountered when a Web site has been reorganized and the administrators have installed redirects to avoid breaking incoming external links.

  • Errors in the 400 through 499 range indicate various client-side errors, and those 500 and up are server-side errors.

When dealing with non-HTTP servers, LWP synthesizes appropriate status codes. For example, when requesting a file from an FTP server, LWP generates a 200 ("OK") response if the file was downloaded, and 404 ("Not Found") if the requested file does not exist.

The LWP library handles some status codes automatically. For example, if a Web server returns a redirection response indicating that the requested URL can be found at a different location (codes 301 or 302), LWP automatically generates a new request directed at the indicated location. The response that you receive corresponds to the new request, not the original. If the response requests authorization (status code 401), and authorization information is available, LWP reissues the request with the appropriate authorization headers.

HTTP::Response headers describe the server, the transaction, and the enclosed content. The most useful headers include Content-type: and Content-length:, which provide the MIME type and length of the returned document, if any, Last-modified:, which indicates when the document was last modified, and Date:, which tells you the server's idea of the time (since client and server clocks are not necessarily synchronized).

Table 9.1. Common HTTP Status Codes and Messages
Code Message Description
1XX codes: informational  
100 Continue Continue with request.
101 Switching Protocols It is upgrading to newer version of HTTP.
2XX codes: success  
200 OK The URL was found. Its contents follows.
201 Created A URL was created in response to a POST.
202 Accepted The request was accepted for processing at a later date.
204 No Response The request is successful, but there's no content.
3XX codes: redirection  
301 Moved The URL has permanently moved to a new location.
302 Found The URL can be temporarily found at a new location.
4XX codes: client errors  
400 Bad Request There's a syntax error in the request.
401 Authorization Required Password authorization is required.
403 Forbidden This URL is forbidden, and authorization won't help.
404 Not Found It isn't here.
5XX codes: server errors  
500 Internal Error The server encountered an unexpected error.
501 Not Implemented Used for unimplemented features.
502 Overloaded The server is temporarily overloaded.

Like the request object, HTTP::Response inherits from HTTP::Message, and delegates unknown method calls to the HTTP::Headers object contained within it. To access header fields, you can call header(), content_type(), expires(), and all the other header-manipulation methods described earlier.

Similarly, the response content can be accessed using the content() and content_ref() methods. Because some documents can be quite large, LWP also provides methods for saving the content directly to disk files and spooling them to subroutines in pieces.

Although HTTP::Response has a constructor, you will not usually construct it yourself, so it isn't listed here. For brevity, a number of other infrequently used methods are also omitted. See the HTTP::Response documentation for full API.

$status_code = $response->code

$status_message = $response->message

The code() and message() methods return information about the outcome of the request. code() returns a numeric status code, and message() returns its human-readable equivalent. You can also provide these methods with an argument in order to set the corresponding field.

$text = $response->status_line

The status_line() method returns the status code followed by the message in the same format returned by the Web server.

$boolean = $response->is_success

$boolean = $response->is_redirect

$boolean = $response->is_info

$boolean = $response->is_error

These four methods return true if the response was successful, is a redirection, is informational, or is an error, respectively.

$html = $response->error_as_HTML

If is_error() returns true, you can call error_as_HTML() to return a nicely formatted HTML document describing the error.

$base = $response->base

The base() method returns the base URL for the response. This is the URL to use to resolve relative links contained in the returned document. The value returned by base() is actually a URI object, and can be used to "absolutize" relative URLs. See the URI module documentation for details.

$request = $response->request

The request() method returns a copy of the HTTP::Request object that generated this response. This may not be the same HTTP::Request that you constructed. If the server generated a redirect or authentication request, then the request returned by this method is the object generated internally by LWP.

$request = $response->previous

previous() returns a copy of the HTTP::Request object that preceded the current object. This can be used to follow a chain of redirect requests back to the original request. If there is no previous request, this method returns undef.

Figure 9.3 shows a simple script named follow_chain.pl that uses the previous() method to show all the intermediate redirects between the requested URL and the retrieved URL. It begins just like the get_url.pl script of Figure 9.1, but uses the HEAD method to retrieve information about the URL without fetching its content. After retrieving the HTTP::Response, we call previous() repeatedly to retrieve all intermediate responses. Each response's URL and status line is prepended to a growing list of URLs, forming a response chain. At the end, we format the response chain a bit and print it out.

Figure 9.3. The follow_chain.pl script tracks redirects

graphics/09fig03.gif

Here is the result of fetching a URL that has been moved around a bit in successive reorganizations of my laboratory's Web site:

% follow_chain.pl http://stein.cshl.org/software/WWW
 Response chain:
        http://stein.cshl.org/software/WWW (302 Found)
        -> http://stein.cshl.org/software/WWW/ (301 Moved Permanently)
        -> http://stein.cshl.org/WWW/software/ (200 OK)
 

LWP::UserAgent

The LWP::UserAgent class is responsible for submitting HTTP::Request objects to remote servers, and encapsulating the response in a suitable HTTP::Response. It is, in effect, a Web browser engine.

In addition to retrieving remote documents, LWP::UserAgent knows how to mirror them so that the remote document is transferred only if the local copy is not as recent. It handles Web pages that require password authentication, stores and returns HTTP cookies, and knows how to negotiate HTTP proxy servers and redirect responses.

Unlike HTTP::Response and HTTP::Request, LWP::UserAgent is frequently subclassed to customize the way that it interacts with the remote server. We will see examples of this in a later section.

$agent = LWP::UserAgent->new

The new() method constructs a new LWP::UserAgent object. It takes no arguments. You can reuse one user agent multiple times to fetch URLs.

$response = $agent->request ($request, [$dest [,$size]])

The request() method issues the provided HTTP::Request, returning an HTTP:: Response. A response is returned even on failed requests. You should call the response's is_success() or code() methods to determine the exact outcome.

The optional $dest argument controls where the response content goes. If it is omitted, the content is placed in the response object, where it can be recovered with the content() and content_ref() methods.

If $dest is a scalar, it is treated as a filename. The file is opened for writing, and the retrieved document is stored to it. Because LWP prepends a > symbol to the filename, you cannot use command pipes or other tricks. Because the content is stored to the file, the response object indicates successful completion of the task, but content(), returns undef.

$dest can also be a reference to a callback subroutine. In this case, the content data is passed to the indicated subroutine at regular intervals, giving you a chance to do something with the data, like pass it to an HTML parser. The callback subroutine should look something like this:

sub handle_content {
   my ($data,$response,$protocol) = @_;
   ...
 }
 

The three arguments passed to the callback are the current chunk of content data, the current HTTP::Response object, and an LWP::Protocol object. The response object is provided so that the subroutine can make intelligent decisions about how to process the content, such as piping data of type image/jpeg to an image viewer. The LWP::Protocol object implements protocol-specific access methods that are used by LWP internally. It is unlikely that you will need it.

If you use a code reference for $dest, you can exercise some control over the content chunk size by providing a $size argument. For example, if you pass 512 for $size, the callback will be called repeatedly with 512-byte chunks of the content data.

Two variants of request() are useful in certain situations.

$response = $agent->simple_request($request, [$dest [,$size]])

simple_request() behaves like request(), but does not automatically reissue requests to handle redirects or authentication requirements. Its arguments are identical to those of request().

$response = $agent->mirror($url,$file)

The mirror() method accepts a URL (a URI object or a string) and the path to a file in which to store the remote document. If the local file doesn't already exist, then mirror() fetches the remote document. Otherwise, mirror() compares the modification dates of the remote and local copies, and only fetches the document if the local copy appears to be out of date. For HTTP URLs, mirror() constructs an HTTP::Request object that has the correct If-Modified-Since: header field to perform a conditional fetch. For FTP URLs, LWP uses the MDTM (modification time) command to fetch the modification date of the remote file.

Two methods allow you to set time and space limits on requests.

$timeout = $agent->timeout([$timeout])

timeout() gets or sets the timeout on requests, in seconds. The default is 180 seconds (3 minutes). If the timeout expires before the request completes, the returned response has a status code of 500, and a message indicating that the request timed out.

$bytes = $agent->max_size([$bytes])

The max_size() method gets or sets a maximum size on the response content returned by the remote server. If the content exceeds this size, then the content is truncated and the response object contains an X-Content-Range: header indicating the portion of the document returned. Typically, this header has the format bytes start-end, where start and end are the start and endpoints of the document portion.

By default, the size is undef, meaning that the user agent will accept content of any length.

The agent() and form() methods add information to the request.

$id = $agent->agent([$id])

The agent() method gets or sets the User-Agent: field that LWP will send to HTTP servers. It has the form name/x.xx (comment), where name is the client software name, x.xx is the version number, and (comment) is an optional comment field. By default, LWP uses libwww-perl/x.xx, where x.xx is the current module version number.

You may need to change the agent ID to trigger browser-specific behavior in the remote server. For example, this line of code changes the agent ID to Mozilla/4.7, tricking the server into thinking it is dealing with a Netscape version 4.X series browser running on a Palm Pilot:

$agent->agent('Mozilla/4.7 [en] (PalmOS)')
 

$address = $agent->from([$address])

The from() method gets or sets the e-mail address of the user responsible for the actions of the user agent. It is incorporated into the From: field used in mail and news postings, and will be issued, along with other fields, to HTTP servers. You do not need to provide this information when communicating with HTTP servers, but it can be provided in Web crawling robots as a courtesy to the remote site.

A number of methods control how the agent interacts with proxies, which are commonly used when the client is behind a firewall that doesn't allow direct Internet access, or in situations where bandwidth is limited and the organization wishes to cache frequently used URLs locally.

$proxy = $agent->proxy($protocol => $proxyy)

The proxy() method sets or gets the proxy servers used for requests. The first argument, $protocol, is either a scalar containing the name of a protocol to proxy, such as "ftp", or an array reference that lists several protocols to proxy, such as ['ftp','http','gopher']. The second argument, $proxy, is the URL of the proxy server to use. For example:

$agent->proxy([qw(ftp http)] => 'http://proxy.cshl.org:8080')
 

You may call this method several times if you need to use a different proxy server for each protocol:

$agent->proxy(ftp => 'http://proxy1.cshl.org:8080');
 $agent->proxy(http => 'http://proxy2.cshl.org:9000');
 

As this example shows, HTTP servers are commonly used to proxy FTP requests as well as HTTP requests.

$agent->no_proxy(@domain_list)

Call the no_proxy() method to deactivate proxying for one or more domains. You would typically use this to turn off proxying for intranet servers that you can reach directly. This code fragment disables proxying for the "localhost" server and all machines in the "cshl.org" domain:

$agent->no_proxy('localhost','cshl.org')
 

Calling no_proxy() with an empty argument list clears the list of proxyless domains. It cannot be used to return the current list.

$agent->env_proxy

env_proxy() is an alternative way to set up proxies. Instead of taking proxy information from its argument list, this method reads proxy settings from *_proxy environment variables. These are the same environment variables used by UNIX and Windows versions of Netscape. For example, a C-shell initialization script might set the FTP and HTTP proxies this way:

setenv ftp_proxy http://proxy1.cshl.org:8080
 setenv http_proxy http://proxy2.cshl.org:9000
 setenv no_proxy localhost,cshl.org
 

Lastly, the agent object offers several methods for controlling authentication and cookies.

($name,$pass) = $agent->get_basic_credentials($realm,$url [,$proxy])

When a remote HTTP server requires password authentication to access a URL, the user agent invokes its get_basic_credentials() method to return the appropriate username and password. The arguments consist of the authentication "realm name", the URL of the request, and an optional flag indicating that the authentication was requested by an intermediate proxy server rather than the destination Web server. The realm name is a string that the server sends to identify a group of documents that can be accessed using the same username/password pair.

By default, get_basic_credentials() returns the username and password stored among the user agent's instance variables by the credentials() method. However, it is often more convenient to subclass LWP::UserAgent and override get_basic_credentials() in order to prompt the user to enter the required information. We'll see an example of this later.

$agent->credentials($hostport,$realm,$name,$pass)

The credentials() method stores a username and password for use by get_basic_credentials(). The arguments are the server hostname and port in the format hostname:port, authentication realm, username, and password.

$jar = $agent->cookie_jar([$cookie_jar])

By default, LWP::UserAgent ignores cookies that are sent to it by remote Web servers. You can make the agent fully cookie-compatible by giving it an object of type HTTP::Cookies. The module will then stash incoming cookies into this object, and later search it for stored cookies to return to the remote server. Called with an HTTP::Cookies argument, cookie_jar() uses the indicated object to store its cookies. Called without arguments, cookie_jar() returns the current cookie jar.

We won't go through the complete HTTP::Cookies API, which allows you to examine and manipulate cookies, but here is the idiom to use if you wish to accept cookies for the current session, but not save them between sessions:

$agent->cookie_jar(new HTTP::Cookies);
 

Here is the idiom to use if you wish to save cookies automatically in a file named .lwp-cookies for use across multiple sessions:

my $file = "$ENV{HOME}/.lwp-cookies";
 $agent->cookie_jar(HTTP::Cookies->new(file=>$file,autosave=>1));
 

Finally, here is how to tell LWP to use an existing Netscape-format cookies file, assuming that it is stored in your home directory in the file ~/.netscape/cookies (Windows and Mac users must modify this accordingly):

my $file = "$ENV{HOME}/.netscape/cookies";
 $agent->cookie_jar(HTTP::Cookies::Netscape->new(file=>$file,
 autosave=>1));
 

LWP Examples

Now that we've seen the LWP API, we'll look at some practical examples that use it.

Fetching a List of RFCs

The Internet FAQ Consortium (http://www.faqs.org/) maintains a Web server that archives a large number of useful Internet documents, including Usenet FAQs and IETF RFCs. Our first example is a small command-line tool to fetch a list of RFCs by their numbers.

The RFC archive at http://www.faqs.org/ follows a predictable pattern. To view RFC 1028, for example, we would fetch the URL http://www.faqs.org/rfcs/rfc1028.html. The returned HTML document is a minimally marked-up version of the original text-only RFC. The FAQ Consortium adds an image and a few links to the top and bottom. In addition, every reference to another RFC becomes a link.

Figure 9.4 shows the get_rfc.pl script. It accepts one or more RFC numbers on the command line, and prints their contents to standard output. For example, to fetch RFCs 1945 and 2616, which describe HTTP versions 1.0 and 1.1, respectively, invoke get_rfc.pl like this:

Figure 9.4. The get_rfc.pl script

graphics/09fig04.gif

% get_rfc.pl 1945 2616
 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
 <HTML>
 <HEAD>
 <TITLE>rfc1945 - Hypertext Transfer Protocol -- HTTP/1.0</TITLE>
 <LINK REV="made"  ="mailto:rfc-admin@faqs.org";>
 <META name="description" content="Hypertext Transfer Protocol -- HTTP/1.0">
 <META name="authors" content="T. Berners-Lee, R. Fielding & H. Frystyk">
 ...
 

The retrieved files can be saved to disk or viewed in a browser.

Lines 14: Load modules We turn on strict syntax checking and load the LWP module. In addition, we define a constant URL prefix to use for fetching the desired RFC.

Line 5: Process command-line arguments We check that at least one RFC number is given on the command line, or die with a usage message.

Lines 68: Create user agent We create a new LWP::UserAgent and change its default User-Agent: field to get_rfc/1.0. We follow this with the original default agent ID enclosed in parentheses.

Lines 918: Main loop For each RFC listed on the command line, we construct the appropriate URL and use it to create a new HTTP::Request GET request. We pass the request to the user agent object's request() method and examine the response. If the response's is_success() method indicates success, we print the retrieved content. Otherwise, we issue a warning using the response's status message.

Mirroring a List of RFCs

The next example represents a slight modification. Instead of fetching the requested RFCs and sending them to standard output, we'll mirror local copies of them as files stored in the current working directory. LWP will perform the fetch conditionally so that the remote document will be fetched only if it is more recent than the local copy. In either case, the script reports the outcome of each attempt, as shown in this example:

% mirror_rfc.pl 2616 1945 11
 RFC 2616: OK
 RFC 1945: Not Modified
 RFC 11: Not Found
 

We ask the script to retrieve RFCs 2616, 1945 and 11. The status reports indicate that RFC 2616 was retrieved OK, RFC 1945 did not need to be retrieved because the local copy is current, and that RFC 11 could not be retrieved because no such file exists on the remote server (there is, in fact, no RFC 11).

The code, shown in Figure 9.5, is only 15 lines long.

Figure 9.5. The mirror_rfc.pl script

graphics/09fig05.gif

Lines 18: Load modules and create user agent The setup of the LWP::UserAgent is identical to the previous example, except that we modify the usage message and the user agent ID appropriately.

Lines 915: Main loop We read RFC numbers from the command line. For each RFC, we construct a local filename of the form rfcXXXX.html, where XXXX is the number of the requested document. We append this to the RFC server's base URL in order to obtain the full remote URL.

In contrast with the previous example, we don't need to create an HTTP::Request in order to do mirroring. We simply pass the remote URL and local filename to the agent's mirror() method, obtaining an HTTP::Response in return. We then print the status message returned by the response object's message() method.

Simulating Fill-out Forms

The previous two examples fetched static documents from remote Web servers. However, much of the interesting content on the Web is generated by dynamic server-side scripts such as search pages, on-line catalogs, and news updates.

Server-side CGI scripts (as well as servlets and other types of dynamic content) are usually driven by fill-out HTML forms. Forms consist of a series of fields to complete: typically a mixture of text fields, pop-up menus, scrolling lists, and buttons. Each field has a name and a value. When the form is submitted, usually by clicking on a button, the names and current values of the form are bundled into a special format and sent to the server script.

You can simulate the submission of a fill-out form from within LWP provided that you know what arguments the remote server is expecting and how it is expecting to receive them. Sometimes the remote Web site documents how to call its server-side scripts, but more often you have to reverse engineer the script by looking at the fill-out form's source code.

For example, the Internet FAQ Consortium provides a search page at http://www.faqs.org/rfcs/ that includes, among other things, a form for searching the RFC archive with text search terms. By navigating to the page in a conventional browser and selecting the "View Source" command, I obtained the HTML source code for the page. Figure 9.6 shows an excerpt from this page, which contains the definition for the search form (it's been edited slightly to remove extraneous formatting tags).

Figure 9.6. Definition of the HTML form used by the FAQ Consortium's RFC seach script

graphics/09fig06.gif

In HTML, fill-out forms start with a <FORM> tag and end with </FORM>. Between the two tags are one or more <INPUT> tags, which create simple fields like text entry fields and buttons, <SELECT> tags, which define multiple-choice fields like scrolling lists and pop-up menus, and <TEXTAREA> tags, which create large text entry fields with horizontal and vertical scrollbars.

Form elements have a NAME attribute, which assigns a name to the field when it is sent to the Web server, and optionally a VALUE attribute, which assigns a default value to the field. <INPUT> tags may also have a TYPE attribute that alters the appearance of the field. For example, TYPE="text" creates a text field that the user can type in, TYPE="checkbox" creates an on/off checkbox, and TYPE="hidden" creates an element that isn't visible in the rendered HTML, but nevertheless has its name and value passed back to the server when the form is submitted.

The <FORM> tag itself has two required attributes. METHOD specifies how the contents of the fill-out form are to be sent to the Web server, and may be one of GET and POST. We'll talk about the implications of the method later. ACTION specifies the URL to which the form fields are to be sent. It may be a full URL or an abbreviated form relative to the URL of the HTML page that contains the form.

Occasionally, the ACTION attribute may be missing entirely, in which case the form fields should be submitted to the URL of the page in which the form is located. Strictly speaking, this is not valid HTML, but it is widely used.

In the example in Figure 9.6, the RFC search form consists of two elements. A text field named "query" prompts the user for the text terms to search for, and a menu named "archive" specifies which part of the archive to search in. The various menu choices are specified using a series of <OPTION> tags, and include the values "rfcs", "rank", and "rfcindex". There is also a submission button, created using an <INPUT> tag with a TYPE attribute of "submit". However, because it has no NAME attribute, its contents are not included in the information to the server. Figure 9.7 shows what this looks like when rendered by a browser.

Figure 9.7. The FAQ Consortium's fill-out form rendered by a browser

graphics/09fig07.gif

When the form is submitted, the browser bundles the current contents of the form into a "query string" using a MIME format known as application/x-www-form-urlencoded. This format consists of a series of name= value pairs, where the names and values are taken from the form elements and their current values. Each pair is separated by an ampersand (&) or semicolon (;). For example, if we typed "MIME types" into the RFC search form's text field and selected "Search RFC Index" from the pop-up menu, the query string generated by the browser would be:

query=MIME%20types&archive=rfcindex
 

Notice that the space in "MIME types" has been turned into the string %20. This is a hexadecimal escape for the space character (0x20 in ASCII). A number of characters are illegal in query strings, and must be escaped in this way. As we shall see, the URI::Escape module makes it easy to create escaped query strings.

The way the browser sends the query string to the Web server depends on whether the form submission method is GET or POST. In the case of GET, a " ? " followed by the query string is appended directly to the end of the URL indicated by the <FORM> tag's ACTION attribute. For example:

http://www.faqs.org/cgi-bin/rfcsearch?query=MIME%20types&archive=rfcindex
 

In the case of a form that specifies the POST method, the correct action is to POST a request to the URL indicated by ACTION, and pass the query string as the request content.

It is very important to send the query string to the remote server in the way specified by the <FORM> tag. Some server-side scripts are sufficiently flexible to recognize and deal with both GET and POST requests in a uniform way, but many do not.

In addition to query strings of type application/x-www-form-urlencoded, some fill-out forms use a newer encoding system called multipart/form-data. We will talk about dealing with such forms in the section File Uploads Using multipart/form-data.

Our next sample script is named search_rfc.pl. It invokes the server-side script located at http://www.faqs.org/cgi-bin/rfcsearch to search the RFC index for documents having some relevance to the search terms given on the command line. Here's how to search for the term "MIME types":

% search_rfc.pl MIME types
 RFC 2503     MIME Types for Use with the ISO ILL Protocol
 RFC 1927     Suggested Additional MIME Types for Associating Documents
 

search_rfc.pl works by simulating a user submission of the fill-out form shown in Figures 9.6 and 9.7. We generate a query string containing the query and archive fields, and POST it to the server-side search script. We then extract the desired information from the returned HTML document and print it out.

To properly escape the query string, we use the uri_escape() function, provided by the LWP module named URI::Escape. uri_escape() replaces disallowed characters in URLs with their hexadecimal escapes. Its companion, uri_unescape(), reverses the process.

Figure 9.8 shows the code for the script.

Figure 9.8. The search_rfc.pl script

graphics/09fig08.gif

Lines 14: Load modules We turn on strict syntax checking and load the LWP and URI::Escape modules. URI::Escape imports the uri_escape() and uri_unescape() functions automatically.

Lines 57: Define constants We define one constant for the URL of the remote search script, and another for the page on which the fill-out form is located. The latter is needed to properly fill out the Referer: field of the request, for reasons that we will explain momentarily.

Lines 810: Create user agent This code is identical to the previous examples, except for the user agent ID.

Lines 1112: Construct query string We interpolate the command-line arguments into a string and use it as the value of the fill-out form's query field. We are interested in searching the archive's RFC index, so we use "rfcindex" as the value of the archive field. These are incorporated into a properly formatted query string and escaped using uri_escape().

Lines 1315: Construct request We create a new POST request on the remote search script, and use the returned request object's content() method to set the content to the query string. We also alter the request object's Referer: header so that it contains the fill-out form's URL. This is a precaution. For consistency, some server-side scripts check the Referer: field to confirm that the request came from a fill-out form located on their own server, and refuse to service requests that do not contain the proper value. Although the Internet FAQ Consortium's search script does not seem to implement such checks, we set the Referer: field here in case they decide to do so in the future.

As an aside, the ease with which we are able to defeat the Referer: check illustrates why this type of check should never be relied on to protect server-side Web scripts from misuse.

Lines 1617: Submit request We pass the request to the LWP::UserAgent's request() method, obtaining a response object. We check the response status with is_success(), and die if the method indicates a failure of some sort.

Lines 1821: Fetch and parse content We retrieve the returned HTML document by calling the response object's content() method and assign it to a scalar variable. We now need to extract the RFC name and title from the document's HTML. This is easy to do because the document has the predictable structure shown in Figures 9.9 (screenshot) and 9.10 (HTML source). Each matching RFC is an item in an ordered list (HTML tag <OL>) in which the RFC number is contained within an <A> tag that links to the text of the RFC, and the RFC title is contained between a pair of <STRONG> tags.

Figure 9.9. RFC Index Search results

graphics/09fig09.gif

Figure 9.10. HTML code for the RFC Index Search results

graphics/09fig10.gif

We use a simple global regular expression match to find and match all lines referring to RFCs, extract the RFC name and title, and print the information to standard output.

An enhancement to this script would be to provide an option to fetch the text of each RFC returned by the search. One way to do this would be to insert a call to $ua->request() for each matched RFC. Another, and more elegant, way would be to modify get_rfc.pl from Figure 9.4 so as to accept its list of RFC numbers from standard input. This would allow you to fetch the content of each RFC returned by a search by combining the two commands in a pipeline:

% fetch_rfc.pl MIME type | get_rfc.pl
 						

Because The Internet FAQ Consortium has not published the interface to its search script, there is no guarantee that they will not change either the form of the query string or the format of the HTML document returned in response to searches. If either of these things happen, search_rfc.pl will break. This is a chronic problem for all such Web client scripts and a compelling reason to check at each step of a complex script that the remote Web server is returning the results you expect.

This script contains a subtle bug in the way it constructs its query strings. Can you find it? The bug is revealed in the next section.

Using HTTP::Request::Common to Post a Fill-out Form

Because submitting the field values from fill-out forms is so common, LWP provides a class named HTTP::Request::Common to make this convenient to do. When you load HTTP::Request::Common, it imports four functions named GET(), POST(), HEAD(), and PUT(), which build various types of HTTP::Request objects.

We will look at the POST() function, which builds HTTP::Request objects suitable for simulating fill-out form submissions. The other three are similar.

$request = POST($url [,$form_ref] [,$header1=>$val1....])

The POST() function returns an HTTP::Request object that uses the POST method. $url is the requested URL, and may be a simple string or a URI object. The optional $form_ref argument is an array reference containing the names and values of form fields to submit as content. If you wish to add additional headers to the request, you can follow this with a list of header/value pairs.

Using POST() here's how we could construct a request to the Internet FAQ Consortium's RFC index search engine:

my $request = POST('http://www.faqs.org/cgi-bin/rfcsearch',
                    [ query   => 'MIME types',
                      archive => 'rfcindex' ]
                    );
 

And here's how to do the same thing but setting the Referer: header at the same time:

my $request =POST('http://www.faqs.org/cgi-bin/rfcsearch',
                    [ query   => 'MIME types',
                      archive => 'rfcindex' ],
                    Referer => 'http://www.faqs.org/rfcs');
 

Notice that the field/value pairs of the request content are contained in an array reference, but the name/value pairs of the request headers are a simple list.

As an alternative, you may provide the form data as the argument to a pseudoheader field named Content:. This looks a bit cleaner when setting both request headers and form content:

my $request = POST('http://www.faqs.org/cgi-bin/rfcsearch',
                    Content => [ query   => 'MIME types',
                                 archive => 'rfcindex' ],
                    Referer => 'http://www.faqs.org/rfcs');
 

POST() will take care of URI escaping the form fields and constructing the appropriate query string.

Using HTTP::Request::Common, we can rewrite search_rfc.pl as shown in Figure 9.11. The new version is identical to the old except that it uses POST() to construct the fill-out form submission and to set the Referer: field of the outgoing request (lines 1217). Compared to the original version of the search_rfc.pl script, the new script is easier to read. More significant, however, it is less prone to bugs. The query-string generator from the earlier versions contains a bug that causes it to generate broken query strings when given a search term that contains either of the characters "&" or "=". For example, given the query string "mime&types", the original version generates the string:

Figure 9.11. An improved version of search_rfc.pl

graphics/09fig11.gif

query=mime&types&archive=rfcindex
 

The manual fix would be to replace " & " with " %26 " and " = " with " %3D " in the search terms before constructing the query string and passing it to uri_escape(). However, the POST()-based version handles this automatically, and generates the correct content:

query=mime%26types&archive=rfcindex
 

File Uploads Using multipart/form-data

In addition to form elements that allow users to type in text data, HTML version 4 and higher provides an <INPUT> element of type "file". When compatible browsers render this tag, they generate a user interface element that prompts the user for a file to upload. When the form is submitted, the browser opens the file and sends it contents, allowing whole files to be uploaded to a server-side Web script.

However, this feature is not very compatible with the application/x-www-form-urlencoded encoding of query strings because of the size and complexity of most uploaded files. Server scripts that support this feature use a different type of query encoding scheme called multipart/form-data. Forms that support this encoding are enclosed in a <FORM> tag with an ENCTYPE attribute that specifies this scheme. For instance:

<FORM METHOD=POST ACTION="/cgi-bin/upload" ENCTYPE="multipart/form-data">
 

The POST method is always used with this type of encoding. multipart/form-data uses an encoding scheme that is extremely similar to the one used for multipart MIME enclosures. Each form element is given its own subpart with a Content-Disposition: of "form-data", a name containing the field name, and body data containing the value of the field. For uploaded files, the body data is the content of the file.

Although conceptually simple, it's tricky to generate the multipart/form-data format correctly. Fortunately, the POST() function provided by HTTP::Request:: Common can also generate requests compatible with multipart/form-data. The key is to provide POST() with a Content_Type: header argument of "form-data":

my $request = POST('http://www.faqs.org/cgi-bin/rfcsearch',
                     Content_Type => 'form-data',
                     Referer      => 'http://www.faqs.org/rfcs',
                     Content      => [ query   => 'MIME types',
                                       archive => 'rfcindex' ]
                    );
 

This generates a request to the RFC search engine using the multipart/form-data encoding scheme. But don't try it: the RFC FAQ site doesn't know how to handle this scheme.

To tell LWP to upload a file, the value of the corresponding form field must be an array reference containing at least one element:

$fieldname => [ $file, $filename, header1=>$value.... ]
 

The mandatory first element in the array, $file, is the path to the file to upload. The optional $filename argument is the suggested name to use for the file, and is similar to the MIME::Entity Filename argument. This is followed by any number of additional MIME headers. The one used most frequently is Content_Type:, which gives the server script the MIME type of the uploaded file.

To illustrate how this works, we'll write a client for the CGI script located at http://stein.cshl.org/WWW/software/CGI/examples/file_upload.cgi. This is a script that I wrote some years ago to illustrate how CGI scripts accept and process uploaded files. The form that drives the script (Figures 9.12 and 9.14) contains a single file field named filename, and three checkboxes named count with values named "count lines", "count words", and "count characters". There's also a hidden field named .cgifields with a value of "count."

Figure 9.12. The form that drives the file_upload.cqi script

graphics/09fig12.gif

Figure 9.14. Output from file_upload.cqi script

graphics/09fig14.gif

After form submission, the script reads the uploaded file and counts its lines, words, and/or characters, depending on which checkboxes are selected. It prints these statistics, along with the name of the file and its MIME type, if any (Figure 9.13).

Figure 9.13. HTML source for the file_upload.cqi form

graphics/09fig13.gif

We will now develop an LWP script to drive this CGI script. remote_wc.pl reads a file from the command line or standard input and uploads it to file_upload.cgi. It parses the HTML result and prints the word count returned by the remote server:

% remote_wc.pl ~/public_html/png.html
 lines = 20; words = 47; characters = 362
 

This is a pretty difficult way to perform a word count, but it does illustrate the technique! Figure 9.15 gives the code for remote_wc.pl.

Figure 9.15. The remote_wc.pl script

graphics/09fig15.gif

Lines 14: Load modules We turn on strict syntax checking and load the LWP and HTTP::Request::Common modules.

Lines 57: Process arguments We define a constant for the URL of the CGI script and recover the name of the file to upload from the command line.

Lines 821: Create user agent and request We create the LWP::UserAgent in the usual way. We then create the request using the POST() function, passing the URL of the CGI script as the first argument, a Content_Type argument of "form-data", and a Content argument containing the various fields used by the upload form.

Notice that the count field appears three times in the Content array, once for each of the checkboxes in the form. The value of the filename field is an anonymous array containing the file path provided on the command line. We also provide values for the .cgifields hidden field and the submit button, even though it isn't clear that they are necessary (they aren't, but unless you have the documentation for the remote server script, you won't know this).

Lines 2223: Issue request We call the user agent's request() method to issue the POST, and get a response object in return. As in earlier scripts, we check the is_success() method and die if an error occurs.

Lines 2427: Extract results We call the response's content() method to retrieve the HTML document generated by the remote script, and perform a pattern match on it to extract the values for the line, word, and character counts (this regular expression was generated after some experimentation with sample HTML output). Before exiting, we print the extracted values to standard output.

Fetching a Password-Protected Page

Some Web pages are protected by username and password using HTTP authentication. LWP can handle the authentication protocol, but needs to know the username and password.

There are two ways to provide LWP with this information. One way is to store the username and password in the user agent's instance variables using its credentials() method. As described earlier, credentials() stores the authentication information in a hash table indexed by the Web server's hostname, port, and realm. If you store a set of passwords before making the first request, LWP::UserAgent consults this table to find a username and password to use when accessing a protected page. This is the default behavior of the get_basic_credentials() method.

The other way is to ask the user for help at runtime. You do this by subclassing LWP::UserAgent and overriding the get_basic_credentials() method. When invoked, the customized get_basic_credentials() prompts the user to enter the required information.

The get_url2.pl script implements this latter scheme. For unprotected pages, it acts just like the original get_url.pl script (Figure 9.1). However, when fetching a protected page, it prompts the user to enter his or her username and password. If the name and password are accepted, the URL is copied to standard output. Otherwise, the request fails with an "Authorization Required" error (status code 401):

% get_url2.pl http://stein.cshl.org/private/
 Enter username and password for realm "example".
 username: perl
 password: programmer
 <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
 <html> <head>
 <title>Password Protected Page</title>
 <link rel="stylesheet"  ="/stylesheets/default.css">
 </head>
 ...
 

If you wish to try this script with the URL given in the example, the username is "perl" and the password is "programmer."

Figure 9.16 shows the code for get_url2.pl. Except for an odd little idiom, it's straightforward. We are going to declare a subclass of LWP::UserAgent, but we don't want to create a whole module file just to override a single method. Instead, we arrange for the script itself (package "main") to be a subclass of LWP::UserAgent, and override the get_basic_credentials() method directly in the main script file. This is a common, and handy, trick.

Figure 9.16. The get_url2.pl script

graphics/09fig16.gif

Lines 16: Load modules We turn on strict syntax checking and load LWP. We also load the PromptUtil module (listed in Appendix A), which provides us with the get_passwd() function for prompting the user for a password without echoing it to the screen.

We set the @ISA array to make sure that the current package is a subclass of LWP::UserAgent.

Lines 712: Issue request, print content The main section of the script is identical to the original get_url.pl, with one exception. Instead of calling LWP::User Agent->new() to create a new user agent object, we call _PACKAGE->new(). The Perl interpreter automatically replaces the _PACKAGE_ token with the name of the current package ("main" in this case), creating the desired LWP::UserAgent subclass.

Lines 1320: Override get_basic_credentials() method This section of the code overrides get_basic_credentials() with a custom subroutine. The subclass behaves exactly like LWP::UserAgent until it needs to fetch authentication information, at which point this subroutine is invoked.

We are called with three arguments, consisting of the user agent object, the authentication realm, and the URL that has been requested. We prompt the user for a username, and then call get_passwd() to prompt and fetch the user's password. These are returned to the caller as a two-element list.

An interesting characteristic of this script is that if the username and password aren't entered correctly the first time, LWP invokes the get_basic_credentials() once more and the user is prompted to try again. If the credentials still aren't accepted, the request fails with an "Authorization Required" status. This nice "second try" feature appears to be built into LWP.

Parsing HTML and XML

Much of the information on the Web is now stored in the form of HTML documents. So far we have dealt with HTML documents in an ad hoc manner by writing regular expressions to parse out the particular information we want from a Web page. However, LWP offers a more general solution to this. The HTML::Parser class provides flexible parsing of HTML documents, and HTML::Formatter can format HTML as text or PostScript.

An added benefit of HTML::Parser is that at the throw of a switch it can handle XML (eXtensible Markup Language) as well. Because HTML was designed to display human-readable documents, it doesn't lend itself easily to automated machine processing. XML provides structured, easily parsed documents that are more software-friendly than traditional HTML. Over the next few years, HTML will gradually be replaced by XHTML, a version of HTML that follows XML's more exacting standards. HTML::Parser can handle HTML, XML, and XHTML, and in fact can be used to parse much of the more general SGML (Standard Generalized Markup Language) from which both HTML and XML derive. The XML standard and a variety of tutorials can be found at [http://www.w3.org/XML/].

In this section, we demonstrate how to use HTML::Formatter to transform HTML into nicely formatted plain text or postscript. Then we show some examples of using HTML::Parser for the more general task of extracting information from HTML files.

Formatting HTML

The HTML::Formatter module is the base class for a family of HTML formatters. Only two members of the family are currently implemented. HTML::FormatText takes an HTML document and produces nicely formatted plain text, and HTML::FormatPS creates postscript output. Neither subclass of HTML::Formatter handles inline images, forms, or tables. In some cases, this can be a big limitation.

There are two steps to formatting an HTML file. The first step is to parse the HTML into a parse tree, using a specialized subclass of HTML::Parser named HTML::TreeBuilder. The second step is to pass this parse tree to the desired subclass of HTML::Formatter to output the formatted text.

Figure 9.17 shows a script named format_html.pl that uses these modules to read an HTML file from the command line or standard input and format it. If given the postscript option, the script produces postscript output suitable for printing. Otherwise, it produces plain text.

Figure 9.17. The format_html.pl script

graphics/09fig17.gif

Lines 14: Load modules We turn on strict syntax checking and load the Getopt::Long and HTML:TreeBuilder modules. The former processes the command-line arguments, if any. We don't load any HTML::Formatter modules at this time because we don't know yet whether to produce plain text or postscript.

Lines 57: Process command-line options We call the GetOptions() function to parse the command-line options. This sets the global variable $PS to true if the postscript option is specified.

Lines 815: Create appropriate formatter If the user requested postscript output, we load the HTML::FormatPS module and invoke the class's new() method to create a new formatter object. Otherwise, we do the same thing with the HTML:: FormatText class. When creating an HTML::FormatPS formatter, we pass the new() method a PaperSize argument of "Letter" in order to create output compatible with the common 81/2 x 11" letter stock used in the United States.

Lines 1618: Parse HTML We create a new HTML::TreeBuilder parser by calling the class's new() method. We then read the input HTML one line at a time using the <> operator and pass it to the parser object. When we are done, we tell the parser so by calling its eof() method.

This series of operations leaves the HTML parse tree in the parser object itself, in a variable named $tree.

Line 1920: Format and output the tree We pass the parse tree to the formatter's format() method, yielding a formatted string. We print this, and then clean up the parse tree by calling is delete() method.

The HTML::Formatter API

The API for HTML::Formatter and its subclasses is extremely simple. You create a new formatter with new() and perform the formatting with format(). A handful of arguments recognized by new() adjust the formatting style.

$formatter = HTML::FormatText->new([leftmargin=>$left,rightmargin=>$right])

HTML::FormatText->new() takes two optional arguments, leftmargin and rightmargin, which set the left and right page margins, respectively. The margins are measured in characters. If not specified, the left and right margins default to 3 and 72, respectively. It returns a formatter object ready for use in converting HTML to text.

$formatter = HTML::FormatPS->new([option1=>$val1, option2=>$val2...])

Similarly, HTML::FormatPS->new() creates a new formatter object suitable for rendering HTML into postscript. It accepts a larger list of argument/value pairs, the most common of which are listed here:

  • PaperSize sets the page height and width appropriately for printing. Acceptable values are A3, A4, A5, B4, B5, Letter, Legal, Executive, Tabloid, Statement, Folio, 10x14, and Quarto. United States users take note! The default PaperSize is the European A4. You should change this to Letter if you wish to print on common 81/2 x 11" paper.

  • LeftMargin, RightMargin, TopMargin, and BottomMargin control the page margins. All are given in point units.

  • FontFamily sets the font family to use in the output. Recognized values are Courier, Helvetica, and Times, the default.

  • FontScale allows you to increase or decrease the font size by some factor. For example, a value of 1.5 will scale the font size up by 50 percent.

Once a formatter is created, you can use it as many times as you like to format HTML::TreeBuilder objects.

$text = $formatter->format($tree)

Pass an HTML parse tree to the format() method. The returned value is a scalar variable, which you can then print, save to disk, or send to a print spooler.

The HTML::TreeBuilder API

The basic API for HTML::TreeBuilder is also straightforward. You create a new HTML::TreeBuilder object by calling the class's new() method, then parse a document using parse() or parse_file(), and when you're done, destroy the object using delete().

$tree = HTML::TreeBuilder->new

The new() method takes no arguments. It returns a new, empty HTML::TreeBuilder object.

$result = $tree->parse_file($file)

The parse_file() method accepts a filename or filehandle and parses its contents, storing the parse tree directly in the HTML::TreeBuilder object. If the parse was successful, the result is a copy of the tree object; if something went wrong (check $! for the error message), the result is undef.

For example, we can parse an HTML file directly like this:

$tree->parse_file('rfc2010.html') or die "Couldn't parse: $!";
 

and parse from a filehandle like this:

open (F,'rfc2010.html') or die "Couldn't open: $!";
 $tree->parse_file(\*F);
 

$result = $tree->parse($data)

With the parse() method, you can parse an HTML file in chunks of arbitrary size. $data is a scalar that contains the HTML text to process. Typically you will call parse() multiple times, each time with the next section of the document to process. We will see later how to take advantage of this feature to begin HTML parsing while the file is downloading. If something goes wrong during parsing, parse() returns undef. If parse() is successful, it will return a copy of the HTML::TreeBuilder object, undef otherwise.

$tree->eof

Call this method when using parse(). It tells HTML::TreeBuilder that no more data is coming and allows it to finish the parse.

Figure 9.16 is a good example of using parse() and eof() to parse the HTML file on standard input one line at a time.

$tree->delete

When you are finished with an HTML::TreeBuilder tree, call its delete() method to clean up. Unlike other Perl objects, which are automatically destroyed when they go out of scope, you must be careful to call delete() explicitly when working with HTML::TreeBuilder objects or risk memory leaks. The HTML::Element POD documentation explains why this is so.

Many scripts combine HTML::TreeBuilder object creation with file parsing using this idom:

$tree = HTML::TreeBuilder->new->parse_file('rfc2010.html');
 

However, the HTML::TreeBuilder object created this way will never be deleted, and will leak memory. If you are parsing files in a loop, always create the HTML::TreeBuilder object, call its parse_file() method, and then call its delete() method.

The parse tree returned by HTML::TreeBuilder is actually a very feature-rich object. You can recursively descend through its nodes to extract information from the HTML file, extract hypertext links, modify selected HTML elements, and then convert the whole thing back into printable HTML. However, the same functionality is also available in a more flexible form in the HTML::Parser class, which we cover later in this chapter. For details, see the HTML::TreeBuilder and HTML::Element POD documentation.

Returning Formatted HTML from the get_url.pl Script

We'll now rewrite get_url.pl a third time in order to take advantage of the formatting features offered by HTML::FormatText. When the new script, imaginatively christened get_url3.pl, detects an HTML document, it automatically converts it into formatted text.

The interesting feature of this script is that we combine LWP::UserAgent's request callback mechanism with the HTML::TreeBuilder parse() method to begin the parse as the HTML document is downloading. When we parallelize downloading and parsing, the script executes significantly faster. Figure 9.18 shows the code.

Figure 9.18. The get_url3.pl script

graphics/09fig18.gif

Lines 16: Load modules We bring in LWP, PromptUtil, HTML::FormatText, and the HTML::TreeBuilder modules.

Lines 711: Set up request We set up the HTTP::Request as we did in earlier iterations of this script. Again, when required, we prompt the user for authentication information so the script is made a subclass of LWP::UserAgent so that we can override the get_basic_credentials() method.

Lines 1214: Send the request We send the request using the agent's request(), method. However, instead of allowing LWP to leave the returned content in the HTTP::Response object for retrieval, we give request() a second argument containing a reference to the process_document() subroutine. This subroutine is responsible for parsing incoming HTML documents.

process_document() leaves the HTML parse tree, if any, in the global variable $html_tree, which we declare here. After the request() is finished, we check the status of the returned HTTP::Response object and die with an explanatory error message if the request failed for some reason.

Lines 1520: Format and print the HTML If the requested document is HTML, then process_document() has parsed it and left the tree in $html_tree. We check to see whether the tree is nonempty. If so, we call its eof() method to tell the parser to finish, and pass the tree to a newly created HTML::FormatText object to create a formatted string that we immediately print. We are now done with the parse tree, so we call its delete() method.

As we shall see, process_document() prints all non-HTML documents immediately, so there's no need to take further action for non-HTML documents.

Lines 2129: The process_document () subroutine LWP::UserAgent invokes call-backs with three arguments consisting of the downloaded data, the current HTTP::Response object, and an LWP::Protocol object.

We call the response object's content_type() method to get the MIME type of the incoming document. If the type is text/html, then we pass the data to the parse tree's parse() method. If necessary, we create the HTML::TreeBuilder first, using the ||= operator so that the call to HTML::TreeBuilder->new() is executed only if the $html_tree variable is undefined.

If the content type is something other than text/html, then we immediately print the data. This is a significant improvement to earlier versions of get_url.pl because it means that non-HTML data starts to appear on standard output as soon as it arrives from the remote server.

Lines 3038: The get_basic_credentials() subroutine This is the same subroutine we looked at in get_url2.pl.

This script does not check for the case in which the response does not provide a content type. Strictly speaking it should do so, as the HTTP specification allows (but strongly discourages) Web servers to omit this field. Run the script with the -w switch to detect and report this case. Useful enhancements to get_url3.pl might include using HTML::FormatPS for printing support, or adapting the script to use external viewers to display non-HTML MIME types the way we did in the pop_fetch.pl script of Chapter 8.

The HTML::Parser Module

HTML::Parser is a powerful but complex module that allows you to parse HTML and XML documents. Part of the complexity is inherent in the structure of HTML itself, and part of it is due to the fact that there are two distinct APIs for HTML::Parser, one used by version 2.2X of the module and the other used in the current 3.X series.

HTML and XML are organized around a hierarchical series of markup tags. Tags are enclosed by angle brackets and have a name and a series of attributes. For example, this tag

<img src="/icons/arrow.gif" alt="arrow">
 

has the name img and the two attributes src and alt.

In HTML, tags can be paired or unpaired. Paired tags enclose some content, which can be plain text or can contain other tags. For example, this fragment of HTML

<p>Oh dear, now the <strong>bird</strong> is gone!</p>
 

consists of a paragraph section, starting with the <p> tag and ending with its mate, the </p> tag. Between the two is a line of text, a portion of which is itself enclosed in a pair of <strong> tags (indicating strongly emphatic text). HTML and XML both constrain which tags can occur within others. For example, a <title> section, which designates some text as the title of a document, can occur only in the <head> section of an HTML document, which in turn must occur in an <html> section. See Figure 9.19 for a very minimal HTML document.

Figure 9.19. A skeletal HTML document

graphics/09fig19.gif

In addition to tags, an HTML document may contain comments, which are ignored by rendering programs. Comments begin with the characters <!-- and end with --> as in:

<!-- ignore this -->
 

HTML files may also contain markup declarations, contained within the characters <! and >. These provide meta-information to validators and parsers. The only HTML declaration you are likely to see is the <!DOCTYPE ...> declaration at the top of the file that indicates the version of HTML the document is (or claims to be) using. See the top of Figure 9.19 for an example.

Because the "<" and ">" symbols have special significance, all occurrences of these characters in proper HTML have to be escaped to the "character entities" &lt; and &gt;, respectively. The ampersand has to be escaped as well, to &. Many other character entities are used to represent nonstandard symbols such as the copyright sign or the German umlaut.

XML syntax is a stricter and regularized version of HTMLs. Instead of allowing both paired and unpaired tags, XML requires all tags to be paired. Tag and attribute names are case sensitive (HTML's are not), and all attribute values must be enclosed by double quotes. If an element is empty, meaning that there is nothing between the start and end tags, XML allows you to abbreviate this as an "empty element" tag. This is a start tag that begins with <tagname and ends with />. As an illustration of this, consider these two XML fragments, both of which have exactly the same meaning:

<img src="/icons/arrow.gif" alt="arrow"></img>
 <img src="/icons/arrow.gif" alt="arrow" />
 
Using HTML::Parser

HTML::Parser is event driven. It parses through an HTML document, starting at the top and traversing the tags and subtags in order until it reaches the end. To use it, you install handlers for events that you are interested in processing, such as encountering a start tag. Your handler will be called each time the desired event occurs.

Before we get heavily into the HTML::Parser, we'll look at a basic example. The print_links.pl script parses the HTML document presented to it on the command line or standard input, extracts all the links and images, and prints out their URLs. In the following example, we use get_url2.pl to fetch the Google search engine's home page and pipe its output to print_links.pl:

% get_url2.pl http://www.google.com | print_links.pl
 img: images/title_homepage2.gif
 link: advanced_search.html
 link: preferences.html
 link: link_NPD.html
 link: jobs.html
 link: http://directory.google.com
 link: adv/intro.html
 link: websearch_programs.html
 link: buttons.html
 link: about.html
 

Figure 9.20 shows the code for print_links.pl.

Figure 9.20. The print_links.pl script

graphics/09fig20.gif

Lines 13: Load modules After turning on strict syntax checking, we load HTML:: Parser. This is the only module we need.

Lines 45: Create and initialize the parser object We create a new HTML::Parser object by calling its new() method. For reasons explained in the next section, we tell new() to use the version 3 API by passing it the api_version argument.

After creating the parser, we configure it by calling its handler() method to install a handler for start tag events. The start argument points to a reference to our print_link() subroutine; this subroutine is invoked every time the parser encounters a start tag. The third argument to handler() tells HTML:: Parser what arguments to pass to our handler when it is called. We request that the parser pass print_link() the name of the tag (tagname) and a hash reference containing the tag's attributes (attr).

Lines 67: Parse standard input We now call the parser's parse() method, passing it lines read via the <> function. When we reach the end of file, we call the parser's eof() method to tell it to finish up. The parse() and eof() methods behave identically to the HTML::TreeBuilder methods we looked at earlier.

Lines 815: The print_link() callback Most of the program logic occurs in print_link(). This subroutine is called during the parse every time the parser encounters a start tag. As we specified when we installed the handler, the parser passes the subroutine the name of the tag and a hash reference containing the tag's attributes. Both the tag name and all the attribute names are automatically transformed to lowercase letters, making it easier to deal with the rampant variations in case used in most HTML.

We are interested only in hypertext links, the <a> tag, and inline images, the <img> tag. If the tag name is "a", we print a line labeled "link:" followed by the contents of the attribute. If, on the other hand, the tag name is "img", we print "img:" followed by the contents of the src attribute. For any other tag, we do nothing.

The HTML::Parser API

HTML::Parser has two APIs. In the earlier API, which was used through version 2 of the module, you install handlers for various events by subclassing the module and overriding methods named start(), end(), and text(). In the current API, introduced in version 3.0 of the module, you call handler() to install event callbacks as we did in Figure 9.20.

You may still see code that uses the older API, and HTML::Parser goes to pains to maintain compatibility with the older API. In this section, however, we highlight only the most useful parts of the version 3 API. See the HTML::Parser POD documentation for more information on how to control the module's many options.

To create a new parser, call HTML::Parser->new().

$parser = HTML::Parser->new(@options)

The new() method creates a new HTML::Parser. @options is a series of option/value pairs that change various parser settings. The most used option is api_version, which can be "2" to create a version 2 parser, or "3" to create a version 3 parser. For backward compatibility, if you do not specify any options new() creates a version 2 parser.

Once the parser is created, you will call handler() one or more times to install handlers.

$parser->handler($event => \&handler, $args)

The handler() method installs a handler for a parse event. $event is the name of the event, &handler contains a reference to the callback subroutine to handle it, and $args is a string telling HTML::Parser what information about the event the subroutine wishes to receive.

The event name is one of start, end, text, comment, declaration, process, or default. The first three events are the most common. A start event is generated whenever the parser encounters a start tag, such as <strong>. An end event is triggered when the parser encounters an end tag, such as </strong>. text events are generated for the text between tags. The comment event is generated for HTML comments. declaration and process events apply primarily to XML elements. Last, the default event is a catchall for anything that is not explicitly handled elsewhere.

$args is a string containing a comma-delimited list of information that you want the parser to pass to the handler. The information will be passed as subroutine arguments in the exact order that they appear in the $args list. There are many possible arguments. Here are some of the most useful:

  • tagname the name of the tag

  • text the full text that triggered the event, including the markup delimiters

  • dtext decoded text, with markup removed and entities translated

  • attr a reference to a hash containing the tag attributes and values

  • self a copy of the HTML::Parser object itself

  • "string" the literal string (single or double quotes required!)

For example, this call causes the get_text() handler to be invoked every time the parser processes some content text. The argument passed to the handler will be a three-element list that contains the parser object, the literal string "TEXT", and the decoded content text:

$parser->handler('text'=>\&get_text, "self,'TEXT',dtext");
 
  • tagname is most useful in conjunction with start and end events. Tags are automatically downcased, so that <UL>, <ul>, and <Ul> are all given to the handler as "ul". In the case of end tags, the "/" is suppressed, so that an end handler receives "ul" when a </ul> tag is encountered.

  • dtext is used most often in conjunction with text events. It returns the nontag content of the document, with all character entities translated to their proper values.

  • The attr hash reference is useful only with start events. If requested for other events, the hash reference will be empty.

Passing handler() a second argument of undef removes the handler for the specified event, restoring the default behavior. An empty string causes the event to be ignored entirely.

$parser->handler($event =>\@array, $args)

Instead of having a subroutine invoked every time the parser triggers an event, you can have the parser fill an array with the information that would have been passed to it, then examine the array at your leisure after the parse is finished.

To do this, use an array reference as the second argument to handler(). When the parse is done, the array will contain one element for each occurrence of the specified event, and each element will be an anonymous array containing the information specified by $args.

Once initialized, you trigger the parse with parse_file() or parse().

$result = $parser->parse_file($file)

$result = $parser->parse($data)

$parser->eof

The parse_file(), parse(), and eof() methods work exactly as they do for HTML::TreeBuilder. A handler that wishes to terminate parsing early can call the parser object's eof() method.

Two methods are commonly used to tweak the parser.

$bool = $parser->unbroken_text([$bool])

When processing chunks of content text, HTML::Parser ordinarily passes them to the text handler one chunk at a time, breaking text at word boundaries. If unbroken_text() is set to a true value, this behavior changes so that all the text between two tags is passed to the handler in a single operation. This can make some pattern matches easier.

$bool = $parser->xml_mode([$bool])

The xml_mode() method puts the parser into a mode compatible with XML documents. This has two major effects. First, it allows the empty element construct, <tagname/>. When the parser encounters a tag like this one, it generates two events, a start event and an end event.

Second, XML mode disables the automatic conversion of tag and attribute names into lowercase. This is because XML, unlike HTML, is case sensitive.

search_rfc.pl Using HTML::Parser

We'll now rewrite search_rfc.pl (Figures 9.8 and 9.10) to use HTML::Parser. Instead of using an ad hoc pattern match to find the RFC names in the search response document, we'll install handlers to detect the appropriate parts of the document, extract the needed information, and print the results.

Recall that the matching RFCs are in an ordered list (<OL>) section and have the following format:

<OL>
   <LI><A  ="ref1">rfc name 1</A> - <STRONG>description 1</STRONG>
   <LI><A  ="ref2">rfc name 2</A> - <STRONG>description 2</STRONG>
   ...
 </OL>
 

We want the parser to extract and print the text located within <A> and <STRONG> elements, but only those located within an <OL> section. The text from other parts of the document, even those in other <A> and <STRONG> elements, are to be ignored. The strategy that we will adopt is to have the start handler detect when an <OL> tag has been encountered, and to install a text handler to intercept and print the content of any subsequent <A> and <STRONG> elements. An end handler will detect the </OL> tag, and remove the text handler, so that other text is not printed.

Figure 9.21 shows this new version, named search_rfc3.pl.

Figure 9.21. The search_rfc3.pl script

graphics/09fig21.gif

Lines 15: Load modules In addition to the LWP and HTTP::Request::Common modules, we load HTML::Parser.

Lines 618: Set up search We create an LWP::UserAgent and a new HTTP::Request in the same way as in the previous incarnation of this script.

Lines 1920: Create HTML::Parser We create a new version 3 HTML::Parser object, and install a handler for the start event. The handler will be the start() subroutine, and it will receive a copy of the parser object and the name of the tag.

Lines 2122: Issue request and parse We call the user agent's request() method to process the request. As in the print_links.pl script (Figure 9.20), we use a code reference as the second argument to request() so that we can begin processing incoming data as soon as it arrives. In this case, the code reference is an anonymous subroutine that invokes the parser's parse() method.

After the request is finished, we call the parser's eof() method to have it finish up.

Line 23: Warn of error conditions If the response object's is_success() method returns false, we die with an error message. Otherwise, we do nothing: The parser callbacks are responsible for extracting and printing the relevant information from the document.

Lines 2431: The start() subroutine The start() subroutine is the callback for the start event. It is called whenever the parser encounters a start tag. We begin by recovering the parser object and the tag name from the stack. We need to remember the tag later when we are processing text, so we stash it in the parser object under the key last-tag. (The HTML::Parser POD documentation informs us that the parser is a blessed hash reference, and specifically invites us to store information there in this manner.)

If the tag is anything other than "ol", we do nothing and just return. Otherwise, we install two new handlers. One is a handler for the text event. It will be passed the parser object and the decoded text. The other is a handler for the end event. Like start(), it will be passed the parser object and the name of the end tag.

Lines 3238: The end() subroutine The end() subroutine is the handler for the end event. It begins by resetting the last_tag key in the parser object. If the end tag isn't equal to "ol", we just return, doing nothing. Otherwise, we set both the text and the end handlers to undef, disabling them.

Lines 3945: The extract() subroutine extract() is the handler for the text event, and is the place where the results from the search are extracted and printed. We get a copy of the parser object and the decoded text on the subroutine call stack. After stripping whitespace from the text, we examine the value of the last_tag key stored in the parser object. If the last tag is "a", then we are in the <A> section that contains the name of the RFC. We print the text, followed by a tab. If the last tag is "strong", then we are in the section of the document that contains the title of the RFC. We print that, followed by a newline.

The new version of search_rfc.pl is more than twice as long as the original, but it adds no new features, so what good is it? In this case, a full-blown parse of the search results document is overkill. However, there will be cases when you need to parse a complex HTML document and regular expressions will become too cumbersome to use. In these cases, HTML::Parser is a life saver.

Extracting Images from a Remote URL

To tie all the elements of this chapter together, our last example is an application that mirrors all the images in an HTML document at a specified URL. Given a list of one or more URLs on the command line, mirror_images.pl retrieves each document, parses it to find all inline images, and then fetches the images to the current directory using the mirror() method. To keep the mirrored images up to date, this script can be run repeatedly.

As the script runs, it prints the local name for the image. For example, here's what happened when I pointed the script at http://www.yahoo.com/:

% mirror_images.pl http://www.yahoo.com
 m5v2.gif: OK
 messengerpromo.gif: OK
 sm.gif: OK
 

Running it again immediately gives three "Not Modified" messages. Figure 9.22 gives the complete code listing for the script.

Figure 9.22. The mirror_images.pl script

graphics/09fig22.gif

Lines 17: Load modules We turn on strict syntax checking and load the LWP, PromptUtil, HTTP::Cookies, HTML::Parser, and URI modules. The last module is used for its ability to resolve relative URLs into absolute URLs.

Lines 811: Create the user agent We again use the trick of subclassing LWP::User Agent to override the get_basic_credentials() method. The agent is stored in a variable named $agent. Some of the remote sites we contact might require HTTP cookies, so we initialize an HTTP::Cookies object on a file in our home directory and pass it to the agent's cookie_jar() method. This allows the script to exchange cookies with the remote sites automatically.

Lines 1215: Create the request and the parser We enter a loop in which we shift URLs off the command line and process them. For each URL, we create a new GET request using HTTP::Request->new(), and an HTML::Parser object to parse the document as it comes in.

We install the subroutine start() as the parse handler for the start event. This handler will receive a copy of the parser object, the name of the start tag, and a hash reference containing the tag's attributes and their values.

Lines 1624: Issue the request We call the agent's request() method to issue the request, returning a response object. As in the last example, we provide request() with a code reference as the second argument, causing the agent to pass the incoming data to this subroutine as it arrives.

In this case, the code reference is an anonymous subroutine. We first check that the MIME type of the response is text/html. If it isn't, we die with an error message. This doesn't cause the script as a whole to die, but does abort processing of the current URL and leaves the error message in a special X-Died: field of the response header.

Otherwise, the incoming document is parseable as an HTML file. Our handler is going to need two pieces of extra information: the base URL of the current response for use in resolving relative URLs, and the user agent object so that we can issue requests for inline images. We use the same technique as in Figure 9.21, and stash this information into the parser's hash reference.

Lines 2527: Warn of error conditions After the request has finished, we check the response for the existence of the X-Died: header and, if it exists, issue a warning. Likewise, we print the response's status message if the is_success() method returns false.

Lines 2837: The start() handler The start() subroutine is invoked by the parser to handle start tags. As called for by the argument list passed to handler(), the subroutine receives a copy of the parser object, the name of the current tag, and a hash reference containing tag attributes.

We check whether we are processing an <IMG> tag. If not, we return without taking further action. We then check that the tag's src attribute is defined, and if so, copy it to a local variable.

The src attribute contains the URL of the inline image, and may be an absolute URL like http://www. yahoo.com/images/messengerpromo.gif, or a relative one like images/messengerpromo.gif. To fetch image source data, we must resolve relative URLs into absolute URLs so that we can request them via the LWP user agent. We must also construct a local filename for our copy of the image.

Absolutizing relative URLs is an easy task thanks to the URI module. The URI->new_abs() method constructs a complete URL given a relative URL and a base. We obtain the base URL of the document containing the image by retrieving the "base" key from the parser hash where we stashed it earlier. This is passed to new_abs() along with the URL of the image (line 33), obtaining an absolute URL. If the URL was already absolute, calling new_abs() doesn't hurt. The method detects this fact and passes the URL through unchanged.

Constructing the local filename is a matter of extracting the filename part of the path (line 34), using a pattern match to extract the rightmost component of the image URL.

We now call the user agent's mirror() method to copy the remote image to our local filesystem and print the status message. Notice how we obtain a copy of the user agent from the parser hash reference. This avoids having to create a new user agent.

Lines 3846: The get_basic_credentials() method This is identical to earlier versions.

There is a slight flaw in mirror_images.pl as it is now written. All images are mirrored to the same directory, and no attempt is made to detect image name clashes between sites, or even within the same site when the image paths are flattened (as might occur, for example, when mirroring remote images named /images/whats_new.gif and /news/hot_news/whats_new.gif).

To make the script fully general, you might want to save each image in a separate subdirectory named after the remote hostname and the path of the image within the site. We can do this relatively painlessly by combining the URI host() and path() methods with the dirname() and mkpath() functions imported from the File::Path and File::Basename modules. The relevant section of start() would now look like this:

...
 use File::Path 'mkpath';
 use File::Basename 'dirname';
 ...
 sub start {
    ...
    my $remote_name = URI->new_abs($url,$parser->{base});
    my $local_name  = $remote_name->host . $remote_name->path;
    mkpath(dirname($local_name),0,0711);
    ...
 }
 

For the image URL http://www.yahoo. com/images/whats_new.gif, this will mirror the file into the subdirectory http://www.yahoo.com/images.

Summary

The LWP module allows you to write scripts that act as World Wide Web clients. You can retrieve Web pages, simulate the submission of fill-out forms, and easily negotiate more obscure aspects of the HTTP protocol, such as cookies and user authentication.

The HTML-Formatter and HTML-Parser modules enhance LWP by giving you the ability to format and parse HTML files. These modules allow you to transform HTML into text or postscript for printing, and to extract interesting information from HTML files without resorting to error-prone regular expressions. As an added benefit, HTML::Parser can parse XML.

There's more to LWP than can be covered in a single book chapter. A good way to learn more about the package is to examine the lwp-request, lwp-download, and lwp-rget scripts, and other examples that come with the package.

Оставьте свой комментарий !

Ваше имя:
Комментарий:
Оба поля являются обязательными

 Автор  Комментарий к данной статье