Web Client Programming with Perl-Chapter 6: Example LWP Programs-P2 - Pdf 72

Chapter 6: Example LWP Programs-P2

Then the scan( ) method does all the real work. The scan( ) method accepts a
URL as a parameter. In a nutshell, here's what happens:
The scan( ) method pushes the first URL into a queue. For any URL pulled
from the queue, any links on that page are extracted from that page and
pushed on the queue. To keep track of which URLs have already been
visited (and not to push them back onto the queue), we use an associative
array called %touched and associate any URL that has been visited with a
value of 1. There are other useful variables that are also used, to track which
document points to what, the content-type of the document, which links are
bad, which links are local, which links are remote, etc.
For a more detailed look at how this works, let's step through it.
First, the initial URL is pushed onto a queue:
push (@urls , $root_url);
The URL is then checked with a HEAD method. If we can determine that
the URL is not an HTML document, we can skip it. Otherwise, we follow
that with a GET method to get the HTML:
my $request = new HTTP::Request('HEAD', $url);
my $response = $self->{'ua'}->request($request);

# if not HTML, don't bother to search it for URLs
next if ($response->header('Content-Type') !~
m@text/html@ ); # it is text/html, get the entity-body this time
$request->method('GET');
$response = $self->{'ua'}->request($request);
Then we extract the links from the HTML page. Here, we use our own
function to extract the links. There is a similar function in the LWP library

contains the content-type for the URL. And finally, the ref( ) method is an
associative array of URLs with values of referring URLs, delimited by \n.
So if the URL hash of "www.ora.com" has a value of "a.ora.com" and
"b.ora.com", that means "a.ora.com" and "b.ora.com" both point to
"www.ora.com".
Here's the complete source of the CheckSite package, with some sample
code around it to read in command-line arguments and print out the results:
#!/usr/local/bin/perl -w
use strict;

use vars qw($opt_a $opt_v $opt_l $opt_r $opt_R
$opt_n $opt_b
$opt_h $opt_m $opt_p $opt_e $opt_d);
use Getopt::Std; # Important variables
#----------------------------
# @lookat queue of URLs to look at
# %local $local{$URL}=1 (local URLs in
associative array)
# %remote $remote{$URL}=1 (remote URLs in
associative array)
# %ref $ref{$URL}="URL\nURL\n" (list of URLs
separated by \n)
# %touched $touched{$URL}=1 (URLs that have been
visited)
# %notweb $notweb{$URL}=1 if URL is non-HTTP
# %badlist $badlist{$URL}="reason" (URLs that
failed. Separated with \n)

if ($opt_e) {$email=$opt_e;}
if (defined $opt_d) {$delay=$opt_d;}
if ($opt_a) {

$print_local=$print_remote=$print_ref=$print_not_we
b=$print_bad = 1;
}

my $root_url=shift @ARGV;

# if there's no URL to start with, tell the user
unless ($root_url) {
print "Error: need URL to start with\n";
exit(-1);
}

# if no "output" options are selected, make
"print_bad" the default
if (!($print_local || $print_remote || $print_ref
||
$print_not_web || $print_bad)) {
$print_bad=1;
}

# create CheckSite object and tell it to scan the
site
my $site = new CheckSite($email, $delay, $max,
$verbose, $proxy);
$site->scan($root_url);

foreach $url (keys %notweb) {
print "notweb: $url\n";
}
}

# print reference list (what URL points to what)
if ($print_ref) {
my $refer_by;
my %ref = $site->ref;

print "\nReference information:\n";
while (($url,$refer_by) = each %ref) {
print "\nref: $url is referenced by:\n";
$refer_by =~ s/\n/\n /g; # insert two spaces
after each \n
print " $refer_by";
}
}

# print out bad URLs, the server response line, and
the Referer
if ($print_bad) {
my $reason;
my $refer_by;
my %bad = $site->bad;
my %ref = $site->ref;

print "\nThe following links are bad:\n";
while (($url,$reason) = each %bad) {
print "\nbad: $url Reason: $reason";

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Web Client Programming with Perl-Chapter 6: Example LWP Programs-P2 - Pdf 72

Tài liệu, ebook tham khảo khác

Học thêm