Web Client Programming with Perl-Chapter 5: The LWP Library- P1 - Pdf 70

Chapter 5: The LWP Library- P1
As we showed in Chapter 1, the Web works over TCP/IP, in which the client
and server establish a connection and then exchange necessary information
over that connection. Chapters See Demystifying the Browser and See
Learning HTTP concentrated on HTTP, the protocol spoken between web
clients and servers. Now we'll fill in the rest of the puzzle: how your
program establishes and manages the connection required for speaking
HTTP.
In writing web clients and servers in Perl, there are two approaches. You can
establish a connection manually using sockets, and then use raw HTTP; or
you can use the library modules for WWW access in Perl, otherwise known
as LWP. LWP is a set of modules for Perl 5 that encapsulate common
functions for a web client or server. Since LWP is much faster and cleaner
than using sockets, this book uses it for all the examples in Chapters See
Example LWP Programs and . If LWP is not available on your platform, see
Chapter 4, which gives more detailed descriptions of the socket calls and
examples of simple web programs using sockets.
The LWP library is available at all CPAN archives. CPAN is a collection of
Perl libraries and utilities, freely available to all. There are many CPAN
mirror sites; you should use the one closest to you, or just go to
http://www.perl.com/CPAN/ to have one chosen for you at random. LWP
was developed by a cast of thousands (well, maybe a dozen), but its primary
driving force is Gisle Aas. It is based on the libwww library developed for
Perl 4 by Roy Fielding.
Detailed discussion of each of the routines within LWP is beyond the scope
of this book. However, we'll show you how LWP can be used, and give you
a taste of it to get you started. This chapter is divided into three sections:
 First, we'll show you some very simple LWP examples, to give you an
idea of what it makes possible.
 Next, we'll list most of the useful routines within the LWP library.
 At the end of the chapter, we'll present some examples that glue

That's it. Obviously there's some error checking that you could do, but if you
just want to get your feet wet with a simple web client, this example will do.
You can call the program geturl and make it executable; for example, on
UNIX:
% chmod +x geturl
Windows NT users can use the pl2bat program, included with the Perl
distribution, to make the geturl.pl executable from the command line:
C:\your\path\here> pl2bat geturl
You can then call the program to retrieve any URL from the Web:
% geturl http://www.ora.com/
<HTML>
<HEAD>
<LINK REV=MADE HREF="mailto:[email protected]">
<TITLE>O'Reilly & Associates</TITLE>
</HEAD>
<BODY bgcolor=#ffffff>
...
Parsing HTML
Since HTML is hard to read in text format, instead of printing the raw
HTML, you could strip it of HTML codes for easier reading. You could try
to do it manually:
#!/bin/perl

use LWP::Simple;

foreach (get $ARGV[0]) {
s/<[^>]*>//g;
print;
}
But this only does a little bit of the job. Why reinvent the wheel? There's

New and Upcoming Releases
...
Extracting Links
To find out which hyperlinks are referenced inside an HTML page, you
could go to the trouble of writing a program to search for text within angle
brackets (<...>), parse the enclosed text for the <A> or <IMG> tag, and
extract the hyperlink that appears after the HREF or SRC parameter. LWP
simplifies this process down to two function calls. Let's take the geturl
program from before and modify it:
#!/usr/local/bin/perl
use LWP::Simple;
use HTML::Parse;
use HTML::Element;

$html = get $ARGV[0];
$parsed_html = HTML::Parse::parse_html($html);

for (@{ $parsed_html->extract_links( ) }) {
$link = $_->[0];
print "$link\n";
}
The first change to notice is that in addition to LWP::Simple and
HTML::Parse, we added the HTML::Element class.
Then we get the document and pass it to HTML::Parse::parse_html( ). Given
HTML data, the parse_html( ) function parses the document into an internal
representation used by LWP.
$parsed_html = HTML::Parse::parse_html($html);
Here, the parse_html( ) function returns an instance of the
HTML::TreeBuilder class that contains the parsed HTML data. Since the

use LWP::Simple;
use HTML::Parse;
use HTML::Element;
use URI::URL;

$html = get $ARGV[0];
$parsed_html = HTML::Parse::parse_html($html);

for (@{ $parsed_html->extract_links( ) }) {
$link=$_->[0];
$url = new URI::URL $link;
$full_url = $url->abs($ARGV[0]);
print "$full_url\n";
}

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Web Client Programming with Perl-Chapter 5: The LWP Library- P1 - Pdf 70

Tài liệu, ebook tham khảo khác

Học thêm