Chapter 6: Example LWP Programs-P1
This chapter presents LWP programs that are more robust and feature-rich
than the examples shown in previous chapters. While Chapter 5, The LWP
Library, focused on teaching LWP and explained how LWP objects fit
together, this chapter shows you some sample LWP programs with more
user-friendly options and features.
We present three broad categories of web client programs:
Simple clients--programs that perform actions for users in real time,
usually with a finite list of URLs to act upon. In this section, we
present LWP versions of the hcat and hgrepurl programs that were
presented in Chapter 4, The Socket Library.
Periodic clients--robots that perform a request repeatedly, with some
delay between each request. Periodic clients typically request the
same resource over and over, or a different resource in a predictable
manner. For example, a client may request 0100.gif at 1 a.m., 0200.gif
at 2 a.m, etc. A periodic client might check some data and perform
action when a condition is met. In this section, we present a program
that periodically checks the status of a Federal Express document.
Recursive clients--robots that follow hyperlinks or other references on
an HTML page. In this section, we present a program that looks for
bad links in a web site.
The boundaries between these categories are not set in stone. It is possible to
write a periodic client that also happens to be a recursive client. Or a simple
client might become periodic if the document indicates that the page should
be refreshed every 15 minutes. We're not trying to classify all programs into
one category or another; these categories are given as a way to identify
distinct behaviors that a client may exhibit.
The examples in this chapter all use a simple command-line interface. In
Chapter 7, Graphical Examples with Perl/Tk, we have some additional
examples with a graphical interface using the Tk extension to Perl.
Simple Clients
print $the_response if ($all || $response);
# get the header data
while(<F>=~ m/^(\S+):\s+(.+)/) {
print "$1: $2\n" if ($all || $header);
}
# get the entity body
if ($all || $data) {
print while (<F>);
}
In LWP, these lines can be written as:
my $code=$response->code;
my $desc = HTTP::Status::status_message($code);
my $headers=$response->headers_as_string;
my $body = $response->content;
if ($opt_r || $all) { print "HTTP/1.0 $code
$desc\n"; }
if ($opt_H || $all) { print "$headers\n";
}
if ($opt_d || $all) { print $body;
}
In addition, we've added proxy support, since it's trivial in LWP:
my $ua = new LWP::UserAgent;
$ua->agent("hcat/1.0");
# If proxy server specified, define it in the User
Agent object
if (defined $proxy) {
my $goterr = 0; # make sure we clear the error
flag
while ($url = shift @ARGV) {
my ($code, $desc, $headers,
$body)=simple_get('GET', $url, $opt_p);
if ($opt_r || $all) { print "HTTP/1.0 $code
$desc\n"; }
if ($opt_H || $all) { print "$headers\n";
}
if ($opt_d || $all) { print $body;
}
$goterr |= HTTP::Status::is_error($code);
}
exit($goterr);
The print-help( ) routine just prints out a range line and a list of command-
line options:
sub print_help {
print <<"HELP";
usage: $0 [-hrmbp] [proxy URL] URLs
-h help
-r response line only
-H HTTP header data only
-d data from entity body
-p use this proxy server
reading"
my $code=$response->code;
my $desc = HTTP::Status::status_message($code);
my $headers=$response->headers_as_string;
my $body = $response->content;
$body = $response->error_as_HTML if ($response-
>is_error);
return ($code, $desc, $headers, $body);
}
Within simple_get( ), an LWP::UserAgent object is created, and a proxy
server is defined for the object if one was specified to simple_get( ). A new
HTTP::Request object is created with the HTTP method and path that are
passed to simple_get( ). The request is given to UserAgent's request( )
method, and an HTTP::Response object is returned. From there,
HTTP::Response::code( ), HTTP::Response::headers_as_string( ), and
HTTP::Response::content( ) are used to extract the response information
from the HTTP::Response object.
Hypertext Grep URLs Revisited
The code that does the HTTP request of hgrepurl looks very much like hcat
's. Instead of repeating that information, let's center on another chunk of
code that changed from the sockets version of hgrepurl.
In Chapter 4, the raw sockets version checked the response code and then
skipped over the HTTP headers:
# if not an "OK" response of 200, skip it
if ($the_response !~ m@^HTTP/\d+\.\d+\s+200\s@)
{return;}
As you can see, LWP simplified a lot of the code. Let's go over hgrepurl in a
little more detail:
#!/usr/local/bin/perl -w
use strict;
use HTTP::Status;
use HTTP::Response;
use LWP::UserAgent;
use URI::URL;
use HTML::Parse;
use vars qw($opt_h $opt_i $opt_l $opt_p);
use Getopt::Std;
my $url;
After calling all the necessary modules and declaring variables, there's the
usual command-line processing with getopts( ):
getopts('hilp:');
my $all = !($opt_i || $opt_l); # $all=1 when
-i -l not set
if ($opt_h || $#ARGV==-1) { # print help text when
-h or no args
print_help( );
exit(0);
}
Any remaining command-line arguments are treated as URLs and passed to
get_html( ):
while ($url = shift @ARGV) {
my ($code, $type, $data) = get_html($url,