MARCH 2004
VOLUME III - ISSUE 3
MARCH 2004
VOLUME III - ISSUE 3
www.phparch.com
The Magazine For PHP Professionals
Plus:
Tips & Tricks, Security Corner, Product Reviews and much more...
Explore your HTML code with Tidy
Testing Automation With PHP
Using the Amazon.com API
through PHP and XML-RPC
PHP And WAP: Past, Present & Future
Matchmaker Matchmaker
Make Me a MatchMake Me a Match
PHP Ahoy!
A Look at: php
Cruise
Bahamas 2004
|
0
R
Q
H
\
%
D
FN
/HDUQ2EMHFW2ULHQWHG3URJUDPPLQJ
ZLWKRYHU3UDFWLFDO3+36ROXWLRQV
D
V
H
7R2UGHU12:YLVLW«
SKSDUFKLWHFWVLWHSRLQWFRP
1
H
Z
5
H
OH
D
V
H
5 Editorial
6 What’s New!
34 Book Review
Flash MX 2004 for Rich Internet
Applications
42 Product Review
Mambo Open Source: Content Management
System
59 Security Corner
Shared Hosting
by Chris Shiflett
63 Tips & Tricks
By John W. Holmes
66 e x i t ( 0 ) ;
●
www.phparch.com
TABLE OF CONTENTS
II NN DD EE XX
II NN DD EE XX
php|architect
Features
Departments
Existing
subscribers
can upgrade to
the Print edition
and save!
Login to your account
for more details.
NEW!
NEW!
*By signing this order form, you agree that we will charge your account in Canadian
dollars for the “CAD” amounts indicated above. Because of fluctuations in the
exchange rates, the actual amount charged in your currency on your credit card
statement may vary slightly.
**Offer available only in conjunction with the purchase of a print subscription.
Choose a Subscription type:
CCaannaaddaa//UUSSAA $$ 8833..9999 CCAADD (($$5599..9999 UUSS**))
IInntteerrnnaattiioonnaall SSuurrffaaccee $$111111..9999 CCAADD (($$7799..9999 UUSS**))
IInntteerrnnaattiioonnaall AAiirr $$112255..9999 CCAADD (($$8899..9999 UUSS**))
CCoommbboo eeddiittiioonn aadddd--oonn $$ 1144..0000 CCAADD (($$1100..0000 UUSS))
((pprriinntt ++ PPDDFF eeddiittiioonn))
Your charge will appear under the name "Marco Tabini & Associates, Inc." Please
allow up to 4 to 6 weeks for your subscription to be established and your first issue
●
www.phparch.com
EE DD II TT OO RR II AA LL RR AA NN TT SS
php|architect
Volume III - Issue 3
March, 2004
Publisher
Marco Tabini
Editorial Team
Arbi Arzoumani
Peter MacIntyre
Eddie Peloke
Graphics & Layout
Arbi Arzoumani
Managing Editor
Emanuela Corso
Director of Marketing
J. Scott Johnson
Account Executive
Shelley Johnston
Authors
John Coggeshall, John Holmes,
Dr. James McCaffrey, George Schlossnagle, Alessandro
Sfondrini, Chris Shiflett, Andrea Trasatti
php|architect (ISSN 1709-7169) is published twelve times a year by Marco Tabini &
Associates, Inc., P.O. Box 54526, 1771 Avenue Road, Toronto, ON M5M 4N5, Canada.
Although all possible care has been placed in assuring the accuracy of the contents of this
magazine, including all associated source code, listings and figures, the publisher assumes
The one I'm most proud of is George Schlossnagle's
regular expressions article. Regexes are something that
pretty much every programmer has to deal with, but
that very few among us really know how to use. In fact,
I've seen developers write extremely complicated code
with the explicit purpose of getting around having to
use a regular expression—and that is just plain wrong.
After all, using the best solution for each problem is
what being a programmer is all about.
Thus, I approached George about writing an article
on regular expressions—and it became quickly evident
that one article would not even come close to covering
the complexity of regex. Now, everyone knows that I
always try my best to stay away from multi-part articles
for a multitude of reasons, but in this case I felt that the
topic more than deserved our attention over multiple
issues and, therefore, George's article is the first in a
series of three. Over the next three months, he will take
you for a ride from the basics (which are covered in this
issue) to the more complex and exotic aspects of regu-
lar expressions, thus hopefully providing the PHP world
with a definitive guide to this topic.
If regular expressions are not your bag, one of the
other topics covered in this month's issue is certain to
tickle your fancy. For example, you may want to read
Alessandro Sfondrini's excellent article on using the
Amazon.com API directly from your PHP website, or
Andrea Trasatti's look at the world of WAP. As you can
probably imagine, both Andrea and Alessandro hail
from my native Italy—and that alone makes their arti-
).
• SQLite has been bundled with PHP. For more
information on SQLite, please visit their web-
site.
• A new SimpleXML extension for easily access-
ing and manipulating XML as PHP objects. It
can also interface with the DOM extension
and vice-versa.
• Streams have been greatly improved, includ-
ing the ability to access low-level socket oper-
ations on streams.
PHP.net also announced the release of PHP 4.3.5 RC
3. This will be the last release candidate prior to the
final release, so please test it as much as possible.
For more information visit
hhttttpp::////wwwwww..pphhpp..nneett//
.
ZEND Optimizer 2.5.1
Zend has announced the release of Zend Optimizer
2.5.1.
Zend.com describes the Optimizer as: "a free applica-
tion that runs the files encoded by the Zend Encoder
and Zend SafeGuard Suite, while enhancing the run-
ning speed of PHP applications.
Benefits:
• Enables users to run files encoded by the Zend
Encoder
• Increases runtime performance up to 40%."
Get more information from
ZZeenndd..ccoomm
more.
For more information visit:
hhttttpp::////ddeevv--wwmmss..ssoouurrccee--
ffoorrggee..nneett//
.
PhpMyAdmin 2.5.6
Phpmyadmin.net has released their latest version of
phpMyAdmin. PHPMyAdmin is a tool written in PHP
intended to handle the administration of MySQL over
the Web.
"Welcome to this new version, aimed at stabilization of
the 2.5 branch. Meanwhile, work is continuing on the new
2.6 branch. PhpMyAdmin is a tool written in PHP intend-
ed to handle the administration of MySQL over the Web.
Currently it can create and drop databases,
create/drop/alter tables, delete/edit/add fields, execute
any SQL statement, manage keys on fields."
For more information visit:
wwwwww..pphhppmmyyaaddmmiinn..nneett
.
PhpSQLiteAdmin 0.2
PhpSQLiteAdmin is a Web interface for the administra-
tion of SQLite databases.
Version 0.2 comes with some new features and a lot
of internal cleanups and refactoring. PhpSQLiteAdmin
is still in an early stage of development. It comes free of
charge and without warranty.
For more information visit:
wwwwww..pphhppssqqlliitteeaaddmmiinn..nneett
.
Several parts of the documentation were updated. A lot
of new language files were added and updated.
For more information visit:
hhttttpp::////ppllaattoonn..sskk//pprroojjeeccttss// pphhppMMyyEEddiitt//
.
ionCube Releases New Encoder
UK-based ionCube has released a new version of their
compiled code PHP encoding tools. New features
include a choice of ASCII or binary encoded file formats
and optional support for OpenSource extensions such
as mmcache.
Prices start at a special price of $159 in their March
20% off sale.
For further information, please visit the homepage of
the Encoder:
hhttttpp::////wwwwww..iioonnccuubbee..ccoomm//ssaa__eennccooddeerr..pphhpp
March 2004
●
PHP Architect
●
www.phparch.com
8
NNEEWW SSTTUUFFFF
oped using PHP, serious testing processes are going to
become an integral part of every good developer's
arsenal of programming tools. What we never quite
considered is that PHP is a great testing platform even
for those projects that are not written using it.
Thankfully, James McCaffrey came to the rescue and
provided us with a wonderful article on the subject.
which appeared in the January issue of php|a, I
showed you what SOAP is and how it can be used
together with PHP. We used a SOAP-encoded docu-
ment to perform a search using the Google Engine,
then we parsed the response to display the results on
our website. To perform these operations, we wrote an
application from scratch; this approach can be great to
understand how SOAP works, but when a customer
asks you to implement a SOAP-based feature in an
application, you can't waste your time in that way.
In this case, there are some libraries that will make
your coding quicker and easier: one of these is
NuSOAP, which allows you to send Remote Procedure
Calls (RPCs) over HTTP.
This article will show you how we can use the
Amazon.com API with NuSOAP to perform searches
and display product details, without having to sort
through a lot of SOAP syntax: if you have had an
opportunity to read my previous article, you will notice
how much shorter an application written this way is,
and how much time can actually be saved by using this
method.
What are Amazon Web Services?
Amazon.com is one of the most widely known on-line
shops. You can find and buy almost everything, from
books to toys to power tools. Several years ago,
Amazon launched a very successful affiliate program,
which they later expanded in their Web Services pro-
gram.
Why would you want to use Amazon Web Services
FF EE AA TT UU RR EE
Connecting to Amazon.com
Web Services with NuSOAP
by Alessandro Sfondrini
PHP: 4.1 and higher
OS:
Any
Other software:: NuSOAP 0.6.4
Code Directory: webs-nusoap
REQUIREMENTS
Have you ever wanted to add an online shop to your
website but gave up on the idea because you lack the
expertise and resources to run it? Using SOAP, you can
connect to Amazon Web Services and create a PHP appli-
cation to remotely browse and search products, add
them to Amazon shopping carts or wish lists and, yes,
you can even earn money on every purchase performed
from your site.
fy each purchase sent through our website.
Getting started
Before we start coding, I recommend you download
the AWS Software Developer's Kit from
hhttttpp::////wwwwww..aammaazzoonn..ccoomm//ggpp//bbrroowwssee..hhttmmll//??nnooddee==33443344664411
. It contains
the License Agreement, a guide (you should have a
look at it to familiarize yourself with the concepts asso-
ciated with the program) and some code samples-
including a few written in PHP!
As I mentioned earlier, you will also have to apply for
your Developer's token-an alphanumerical string need-
ciative array. We then create a new
ssooaappcclliieenntt
object,
passing two arguments to the constructor: the SOAP
server address and a boolean value that indicates
whether the server uses a WSDL document. WSDL
(Web Services Description Language) documents con-
tain information about a web service, as well as its
methods and properties. They are often used by web
service providers—including Amazon.
Once we have created the object, all we have to do
is to actually execute the RPC by invoking the
ccaallll(())
method and specifying the remote method name and
the parameters to be passed (contained in
$$ppaarraammss
in
our case). NuSOAP automatically fetches the results of
the call and stores them in the
$$rreessuulltt
array.
Since we are working with a WSDL-based server,
NuSOAP can actually create a "proxy" PHP class capa-
ble of providing a better interface to our scripts. Once
we have instantiated
$$ss
, we can also invoke a remote
mmeetthhoodd
in this way:
$proxy = $s -> getProxy();
vvhhss
, etc.). You
can find a complete list of all the
IDs available in the AWS documenta-
tion.
tag String
Your Associate ID. If you don't
have one, you can use the generic
ID
wweebbsseerrvviicceess--2200
.
type String
Determines the type of search
results.
LLiittee
indicates a simpler
result set, while
hheeaavvyy
provides a
richer set of information about
each item returned. We'll use
lliittee
for our example.
devtag String
The Developer Token you have
received from Amazon.
Figure 2
RReessuulltt DDaattuumm TTyyppee DDeessccrriippttiioonn
Url String
The URL of the product page for
20.55")
OurPrice String
The product's selling price on
Amazon, including the currency sym-
bol
UsedPrice String
The product's price for used
copies.
This can be useful to simplify our code: first, we cre-
ate a proxy client,
$$pprrooxxyy
; any subsequent RPCs to
methods specified in the WSDL can be performed using
the proxy, without having to use the NuSOAP
ccaallll(())
method again. In our application, we will use proxies to
work with AWS.
Designing the application
Now that we've laid down some ground rules, it's time
to decide in detail what the goals of our application are
going to be. Since we're all PHP fans, our example web-
site will be about PHP and, therefore, we'll want to
allow our users to buy books on this topic from
Amazon.
The first thing that we need is a search page: users
will be able to search for a particular keyword (or for a
set of keywords) and the page will display some basic
information about each book that matches the criteria,
such as its title, an image, the publishing company,
author or authors and price. We also have to provide a
of data about each search result matching
our search criteria that is included in the
page we have requested. Given that a search
only returns a maximum of ten items per
page, you can expect that this array will
contain no more than ten elements. The
lliittee
search mode returns the data shown in
Figure 2.
March 2004
●
PHP Architect
●
www.phparch.com
11
FFEEAATTUURREE
Connecting to Amazon.com Web Services with NuSOAP
1 <form action=”<?=$PHP_SELF ?>” method=”GET”>
2 <input type=”text” name=”keyword” value=”” />
3 <input type=”hidden” name=”page” value=1 />
4 <input type=”submit” name=”button” value=”Search!” />
5 </form>
6 <?php
7 if (empty($_GET[“keyword”])) // If the form has’n been submitted
8 exit; // Stops the execution
9
10 require(“nusoap.php”);
11
12 $client = new soapclient(“ true);
13 $proxy = $client -> getProxy(); // Creates a WSDL client and a proxy
44 ?>
Listing 1
As you can see, the
KKeeyywwoorrddSSeeaarrcchhRReeqquueesstt(())
method
returns quite a few pieces of information for every
result item, although, of course, we don't have to out-
put all of them on our site. If you look at Listing 1—the
source for our search page—you'll see that the very first
part of the file is nothing more than a simple HTML
form, which contains an input text box for the keyword
and a hidden field that forces the page number to 1—
this way, a new search will automatically start from the
first page of results.
The form uses the GET method because we need to
use links for the "Next Page" and "Previous Page" oper-
ations (something like
ppaaggee..pphhpp??kkeeyywwoorrdd==bbllaahh&&ppaaggee==22
).
Naturally, you could also use POST, but in that case it
would be much more difficult for someone to create a
direct link to your search results, which could, in theo-
ry, prevent you from completing some sales.
The second part of the script contains the actual PHP
code. First of all, an if-then-else control block stops the
execution of the script if
$$__GGEETT[[""kkeeyywwoorrdd""]]
is empty.
Otherwise, we include NuSOAP and create a SOAP
client by passing the URI of the
in a format that is comprehensible to the user. First, we
check whether there are any results to begin with. If
the search returned no data, the program displays a
warning and exits. Otherwise, we print a short summa-
ry of the search: the keyword, the current page num-
ber and total page count, followed by details about
each product in the current result page. These are actu-
ally produced by a simple
ffoorreeaacchh
loop, which brows-
es the
$$rreessuullttss[[""DDeettaaiillss""]]
array,
eecchhoo
ing the title of
each book, a medium-size image, its authors, publish-
ing company and prices. We will also provide a link to
another page,
ddeettaaiillss..pphhpp
, which contains further
information on each book. The link contains a refer-
ence to the product's ASIN (the Amazon identifier for
each product) in order to make the application able to
retrieve the correct product from Amazon's catalogue
with another RPC.
The last part of this page allows the user to browse
the results: if the current page isn't the first one (Page
March 2004
●
PHP Architect
Lists
Array of
Strings
The names of the ListMania lists
that contain the product
BrowseList
Array of
Arrays
Indicates the product categories in
which the product can be found. Its
contents look like this:
BrowseList =>
Array
(
[0] => Array
(
BrowseName => PHP
)
)
Media String
The type of medium on which the
product is distributed (e.g.:
paperback or hardcover for books)
Isbn String
The ISBN code of the product (books
only)
Availability String
Indicates how long the product
takes to be shipped
Reviews Array
lar book. The AWS method we need in this case is
AAssiinnSSeeaarrcchhRReeqquueesstt(())
, which needs the parameters
shown in Figure 4. Just like before, the response that we
get back from Amazon is an array of arrays—except
that, in this case, we will simply concern ourselves with
the first result set, since the ASIN uniquely identifies
one product. Our data, therefore, will be stored in
$$rreessuullttss[[''DDeettaaiillss'']][[00]]
, which, in turn, will contain
the information shown in Figure 5. As you can see,
some of the values returned are the same as the results
of the
KKeeyywwoorrddSSeeaarrcchhRReeqquueesstt(())
call that we used in
Listing 1, while some others, like the customer reviews,
are more appropriate for a detailed product page.
Speaking of the product page, Listing 2 contains the
code for
ddeettaaiillss..pphhpp
. First, we check
$$__GGEETT[[""aassiinn""]]
; if
it is empty, the program displays a warning and exits.
In a more complete application, you may want a slight-
ly more verbose explanation of what went wrong, or
perhaps an automatic redirection to the search page.
If we have an ASIN, we include the NuSOAP library,
then create a SOAP client and proxy as we did in the
previous page. Please note that we have to use
●
www.phparch.com
13
FFEEAATTUURREE
Connecting to Amazon.com Web Services with NuSOAP
1 <?php
2 if(empty($_GET[“asin”]))
3 die(“<h3>No ASIN specified</h3>”);
4
5 require(“nusoap.php”);
6 $_GET[“asin”] = sprintf(“%010d”, $_GET[“asin”]);
7
8 $client = new soapclient(“ true);
9 $proxy = $client -> getProxy(); // Creates a WSDL client and a proxy
10
11 $param = array(
12 ‘asin’ => $_GET[“asin”],
13 ‘tag’ => ‘webservices-20’,
14 ‘type’ => ‘heavy’,
15 ‘devtag’ => ‘YOUR-DEV-TOKEN’
16 );
17
18 $results = $proxy -> AsinSearchRequest($param); // Calls the method
19 ?>
20 <h1><?=$results[“Details”][0][“ProductName”] ?></h1>
21 <img src=”<?=$results[“Details”][0][“ImageUrlLarge”] ?>” align=”left” height=”350” />
22 <b>Authors:</b> <?=@implode(‘, ‘, $results[“Details”][0][“Authors”])?><br /><br />
23 <b>Published by</b> <?=$results[“Details”][0][“Manufacturer”]?>
24 <b> on</b> <?=$results[“Details”][0][“ReleaseDate”]?><br /><br />
25 <b>List Price</b>: <?=$results[“Details”][0][“ListPrice”] ?> -
As you have probably noticed, writing a SOAP-based
application using a library like NuSOAP is much faster
than developing your own SOAP classes—if you have
read my article about the Google API that appeared on
the January issue of php|a, you probably know what I
am talking about. This means that you can develop
rather complex applications without having to waste
time dealing with the nitty-gritty details of the underly-
ing protocol; in fact, we didn't even write any SOAP
code for our Amazon application—NuSOAP did it all for
us.
Naturally, the code that I have introduced here is very
basic and could stand to gain from some improve-
ments. For instance, Amazon Web Services allow you to
to manage a a remote shopping cart or wish list by
adding and removing items to them. The very last part
of the purchase—the one where money changes
hands—must still take place on Amazon.com, but you
can let the user perform most of the normal operations
associated with an e-commerce website without leav-
ing your website. However, do keep in mind that if you
choose to manage the user's shopping cart remotely,
you can't change it once you've submitted to
Amazon—this is done to protect the end user from
fraudulent transactions. You can check out the AWS
documentation for more details on this topic—you'll
find that it's not complicated at all.
Depending on your needs, you may choose to per-
form a different kind of search operation on your web-
site: by similar products, by author, by ISBN, by manu-
can be set to uk, de or jp, depending on which Amazon
March 2004
●
PHP Architect
●
www.phparch.com
14
FFEEAATTUURREE
Connecting to Amazon.com Web Services with NuSOAP
Figure 3
website you are referring to.
I'm Outta Here
Amazon.com Web Services is a powerful tool that you
can use to add e-commerce functionality to your site
without going to the expense of developing an online
store of your own and stocking all the merchandise.
Even if you can't create a complete on-line shop using
ASW (because the purchase must be completed on the
Amazon website), you can still give your users a cus-
tomized shopping experience that relies on the practi-
cally limitless resources of one of the world's most pop-
ular e-commerce websites.
The sample application that I showed you in this arti-
cle is quite simple: if you plan to use it in a production
environment—especially if your site has a lot of traffic—
you should probably consider implementing features
like error handling and caching in order to prevent
problems with the Amazon servers. Adding these ele-
ments to your application may require some extra
work, but it could all pay off if you enjoy decent traffic
has already written some on-line PHP tutorials and published scripts on
most important Italian web portals. You can contact him at
ggiiuu__aallee22@@hhoottmmaaiill..ccoomm
.
FavorHosting.com offers reliable and cost effective web hosting...
SETUP FEES WAIVED AND FIRST 30 DAYS FREE!
So if you're worried about an unreliable hosting provider who won't be
around in another month, or available to answer your PHP specific
support questions. Contact us and we'll switch your information and
servers to one of our reliable hosting facilities and you'll enjoy no
installation fees plus your first month of service is free!*
Please visit />call 1-866-4FAVOR1 now for information.
- Strong support team
- Focused on developer needs
- Full Managed Backup Services Included
Our support team consists of knowledgable and experienced
professionals who understand the requirements of installing and
supporting PHP based applications.
R
egular expressions (commonly known as regexes)
are a powerful tool for pattern matching and text
manipulation. A typical problem that pulls people
into learning regular expressions is text munging: you
have a string of text and you need to replace portions
of it based on certain rules. For instance, you
might want to obfuscate all the email addresses
in a block of text so that email addresses like
ggeeoorrggee@@eexxaammppllee..ccoomm
get translated to the form
ggeeoorrggee [[aatt]] eexxaammppllee [[ddoott]] ccoomm..
Applications: N/A
Code Directory: match-regex
REQUIREMENTS
A quick search for the words "hate" and "regular expres-
sions" on your favourite search engine is likely to bring up
thousands upon thousands of hits. While most developers
recognize the usefulness of regular expressions (and many
can't do without them once they have figured out how
regexes work), their use remains something of a black-
magic art—right up there with hypnosis and session man-
agement. Despite looking complicated, however, regular
expressions are much easier to work with than most peo-
ple are willing to admit.
Before we get started, we should dispel a
few popular myths about regexs:
Myth: Regular Expressions are Slow.
Truth: Regular expressions can be slow,
but they don't need to be. The main reg-
ular expression library used by PHP (called
PCRE and consisting of the
pprreegg__
family of
functions) is quite fast and also quite
powerful. This power means that it is
easy to write a short regular expression
that performs a lot of work, and perform-
ing a lot of work with any tool can be
slow.
Myth: You should use basic string func-
tions instead of regular expressions.
Despite its simplicity, this example illustrates the
basic syntax of a regex match. The regex itself is the
first parameter, and is contained within slashes ([/]).
The second parameter is the text you want to test
the pattern against. The
pprreegg__mmaattcchh
function returns
ttrruuee
if the match succeeds, and
ffaallssee
if it fails. Using
slashes to delimit regular expressions is a convention
(taken from the UNIX utility awk), but is not neces-
sary—you can actually use any non-alphanumeric
character. Alternative delimiters are convenient if
your pattern itself contains slashes.
For instance, when dealing with file
paths or URLs (both of which con-
tain numerous slashes), it is common
to use a different delimiter.
We can also perform substitutions
with PCREs. To substitute 'george
aatt
nospam.example.com' for my address
(a common anti-spam technique), you
can use
preg_replace("/george@example\.com/",
"george [at]
nospam.example.com",
$text);
cle.
•
pprreegg__rreeppllaaccee__ccaallllbbaacckk
—This function
makes it possible to perform very complex
operations on a per-match basis through
the use of callback functions. We will cover
it in a future article, but some of its func-
tionality overlaps with evaluated replace-
ments, which are discussed in this article.
•
pprreegg__qquuoottee((ssttrriinngg tteexxtt))
—When using input
text in a pattern, you may want to sanitize it
to ensure it does not contain any regex
metacharacters.
pprreegg__qquuoottee
escapes all regex
metachacters in a string.
•
pprreegg__sspplliitt((ssttrriinngg ppaatttteerrnn,, ssttrriinngg ssuubbjjeecctt
[[,, iinntt lliimmiitt [[,, iinntt ffllaaggss]]]]))
—
pprreegg__sspplliitt
performs similarly to
eexxppllooddee
, allowing us to
break up the string
ssuubbjjeecctt
into
• Grouping—Grouping allows for changing
the precedence of operations as well as
providing a means to extract the text you
matched with a pattern.
• Enumerations—Enumerators allow you to
specify how many times a character class or
sub-pattern appears. This allows for conven-
March 2004
●
PHP Architect
●
www.phparch.com
17
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
“The power of regu-
lar expressions is
in matching com-
plex patterns that
cannot be identi-
fied using straight-
forward text-
search functions
like
ssttrrssttrr(())
.”
ient expression of fixed length patterns like
'a US zipcode is 5 digits' as well as variable
length patterns such as 'a domain is a num-
ber of alphanumeric characters separated by
match this pattern, you could use the following regular
expression:
/\d\d\d-\d\d\d-\d\d\d\d/
The
\\dd
specifier is a built-in PCRE character class
that consists of all the digits. There are a couple
things you should note about the pattern above. The
first is that we have many
\\dd
's. In regular expres-
sions, any character or character class matches only
a single character unless you use an enumerator
(which we'll cover later) to attach a quantity to it.
Second, if you test this pattern you will find the fol-
lowing results.
• 555-123-4567 matches. This is correct.
• 5555-123-45678 matches. This is not cor-
rect.
The second example does not represent a valid
phone number (the area code and line number are too
long), but it matches because the pattern fits as shown
in Figure 1.
There are a couple of ways to combat this problem.
If you know that your search text should be exactly a
phone number (with no leading or trailing text), you
can use positional anchors to force the pattern to start
at the beginning of the text and end at the end, as we'll
see later on.
If the phone number might be contained in text, on
Continuing the testing, we find that "077-xxx-yyyy"
matches. US and Canadian area codes and exchanges
cannot begin with 0 or 1 (these are reserved for long
distance and operator-assisted or international servic-
es). To be able to restrict the leading numbers to the
allowed set, we need to be able to create our own
character classes. In PCRE, these are constructed by
filling a set of brackets (
[[ ]]
) with the characters we
want to match. To match 2-9, we can use the charac-
ter class
[[2233445566778899]]
, which is commonly shortened via
a range operator to
[[22--99]]
. To use a custom character
class in a pattern, you use it exactly as you would a
regular character or character class. Here is the phone
number pattern reworked to employ this:
/\b[2-9]\d\d-[2-9]\d\d-\d\d\d\d\b/
March 2004
●
PHP Architect
●
www.phparch.com
18
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
RReeggeexx ddooeessnn''tt aallwwaayyss wwoorrkk tthhee wwaayy yyoouu eexxppeecctt
' and '
[[ ]]
' have special meanings in cus-
tom character classes, if you want those actual char-
acters to be elements of the class, you should escape
them with a backslash (
\\
). The two exceptions are
the range operator
--
, which can appear un-escaped
as the last character in a class, since that is unam-
biguous, and the negation character
^^
, which can
appear un-escaped in any position but the first.
Grouping and Sub-Patterns
Usually, you will not only want to match a pattern, but
extract data from it as well. To extract a specific part of
a pattern, you surround it within parentheses. For
example, to capture each part of the phone number
pattern, you would add parentheses as follows:
/\b([2-9]\d\d)-([2-9]\d\d)-(\d\d\d\d)\b/
March 2004
●
PHP Architect
●
www.phparch.com
19
FFEEAATTUURREE
pass a third argument to {preg_match}. This argu-
ment is set by the function as an array with the cap-
tured sub-pattern results in it. The zeroth element the
array is the text matched by the pattern as a whole,
while the sub-patterns captures are at the offset of
their pattern number. Patterns are numbered left-to-
right and outside-to-inside. So in the pattern above
the entire phone number is offset 0, the area code is
sub-pattern 1, the exchange is sub-pattern 2, and the
line number is sub-pattern 3.
Here you can see a sample phone number being run
through the regular expression.
$text = 'My phone number is 555-321-1212';
preg_match("/\b([2-9]\d\d)-([2-9]\d\d)-(\d\d\d\d)\b/",
$text, $matches);
print_r($matches);
Executing that code yields the following results, just
as we predicted:
Array
(
[0] => 555-321-1212
[1] => 555
[2] => 321
[3] => 1212
)
We can also nest patterns. If we wanted to capture
the entire local part of the phone number, in addition
to its componentized parts, the regex could be modi-
fied to be:
/\b([2-9]\d\d)-(([2-9]\d\d)-(\d\d\d\d))\b/
passed the sub-pattern references as
\\11
, but when we
double-quote a string, PHP attempts to interpret the
escaped characters for us. Single-quoting performs no
such interpretation and leaves your references
untouched. This is the same process by which "\n"
becomes a newline, but '\n' remains literally '\n'.
We can reference sub-patterns in matches as well,
using the same rules. A fun example of this is finding
all 6-letter palindromes. A palindrome is a word that
is spelled the same forward and backward, for exam-
ple 'noon' or 'deed'. To spot a six-letter palindrome,
we match 3 characters and require that we see them
immediately in reverse order. Here is the pattern:
March 2004
●
PHP Architect
●
www.phparch.com
20
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
This isn't the full story on RFC compliant email
addresses. Because the specification allows for
addresses to contain descriptions as well, a com-
pletely accurate email address validator is actu-
ally quite complex. An example can be found at
the end of Mastering Regular Expressions in Perl
- the regex presented there is X characters long!
6 while(($line = fgets($fp)) !== false) {
7 if(preg_match(‘/\b(\w)(\w)(\w)\3\2\1\b/’, $line)) {
8 print “palindrome: $line\n”;
9 }
10 }
Listing 1
/\b(\w)(\w)(\w)\3\2\1\b/
When we run this pattern against a palindrome like '
hallah', it matches as shown in Figure 5.
Notice that you need to use
\\bb
to make sure you
don't misidentify words that contain palindrome sub-
strings. If you are running on a UNIX system, Listing 1
is a code block that will find all the six-letter palin-
dromes in the dictionary file
//uussrr//sshhaarree//ddiicctt//wwoorrddss
.
When we use
pprreegg__mmaattcchh__aallll
with sub-patterns, we
have two choices of how we want the data returned to
us. The default behavior is for the match array to con-
tain an array for each sub-pattern, where that array
contains the capture for the nth search match as its nth
element. If that's confusing, here is how it looks when
matching all the phone numbers in a text:
<?php
$text = 'Work: 877-555-1212, Fax: 888-555-1212';
preg_match_all("/\b([2-9]\d\d)-([2-9]\d\d)-
. With this flag set, the ordering of the
match array is reversed: the match array contains one
element for each search text matched, with that array
containing the sub-pattern captures for that search
text. If we are looking to replicate the Perl idiom
while($text =~ /$regex/g) {
# perform work on one set of matches at a time
}
you can accomplish it with this PHP:
preg_match_all($regex, $text, $matches,
PREG_SET_ORDER);
foreach($matches as $match) {
// perform work on one set of matches at a time
}
Enumerations
Another important feature in pattern matching is the
ability to match variable-length patterns. In the phone
number example, even though the digits of the num-
ber were unknown, the length of the pattern was
fixed—it is always a three digit area code, three digit
exchange and four digit line number. On the other
hand, if we are matching email addresses, we don't a
priori know the length of the address.
To handle this, PCRE supplies enumeration modifiers.
The most basic description of an email address is a
number of non-whitespace characters, followed by an
'@', followed by more non-whitespace characters.
\\SS
is
the character class for all non-whitespace characters, so
●
www.phparch.com
21
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
Figure 6
EEnnuummeerraattiioonn MMooddiiffiieerrss
* Match 0 or more times.
+ Match 1 or more times.
? Match 0 or 1 times.
{m} Match exactly m times.
{m,n} Match between m and n times.
{m,} Match at least m times.
{,n} Match between 0 and n times.
According to RFC 2822, which defines the "official"
valid email address syntax, an email message is com-
posed of a localpart, an '@' and a domain. The localpart
is one or more characters from the set
[[\\ww!!##$$%%""**++\\//==??``{{}}||~~^^--]]
, while a domain is a dot-sepa-
rated list of parts composed of
\\ww--
. The pattern for the
local part is almost identical to the definition of
\\SS++
:
/[\w!#$%"*+\/=?`{}|~^-]+/
The pattern for domains is more complex. First, we
need to identify elements in the string. These are given
by
acters, alternations allow for matching a string against
multiple sub-patterns. For example, we might want to
identify all HTTP and FTP addresses in a document for
auto-linking or indexing purposes. We could do this
with two regular expressions:
#https?://\S+#
#ftp://\S+#
but this will require the document to be completely
scanned twice. Note that we are using
##
as a delimiter
and not
//
, since our pattern contains slashes and we
would rather not have to escape them. A more elegant
approach is to combine them using an alternation, as
follows:
#(https?|ftp)://\S+#
The alternation operator
||
means that the sub-pat-
tern
##((hhttttppss??||ffttpp))##
matches either
##hhttttppss??##
('http'
with an optional 's') or
##ffttpp##
. To use this to automati-
cally create anchor tags for all linked content, we can
pattern is part of a single-quoted string.
Positional Anchors
In the example of matching valid US phone numbers,
the regular expression we had was good for spotting
phone numbers in a block of text, but not for validat-
ing that a block of text is a phone number. To do that,
we need to ensure that the phone number is the only
element in the search text, with no leading or trailing
components. Anchors help solve this problem. To man-
date that our phone number match starts at the begin-
ning of the search test and ends at the end of it, we can
modify our regex as follows:
/^([2-9]\d{2})-([2-9]\d{2})-(\d{4})$/
The leading
^^
anchors the match at the beginning
of the text, meaning that the match will only succeed
March 2004
●
PHP Architect
●
www.phparch.com
22
FFEEAATTUURREE
Matchmaker, Matchmaker Make Me A Match
if it begins there. The trailing
$$
anchors the match at
the end of the text, meaning that the match will only
succeed if the pattern terminates on the final charac-
[[^^]]
. Because an anchor is
not a character class (in fact it's a special zero-length
look behind assertion, but that's a topic for a later arti-
cle), it has no meaning inside a character class.
Anchors are also useful for extracting information
near the beginning or end of a string. For example, a
line from an Apache Common Log Format logfile looks
like the following:
10.80.117.254 - - [13/Feb/2004:14:53:01 -0500]
"GET /~george/blog/ HTTP/1.1" 200 43489
This says that on February 13, 2004 a request for
"/~george/blog/" was made from the IP address
10.80.117.254. This request was successful (it returned
a 200 Request OK response code), and the amount of
data returned was 43489 bytes. Writing a full parser for
this log line is not too difficult (we will do so in the
cookbook section at the end of the article), but many
queries do not require parsing the entire log. For
instance, if we want to count the number of occur-
rences of each response code, the expression to use is
quite simple. Looking at the log format, we see that the
last two fields are numbers, and we want the next to
last one. Expressed as a regex, that pattern looks like
this:
/(\d+) \d+$/
Working backwards, this says we first match the end
of the line (
$$
), then a number (which we don't bother
$domain = preg_quote($domain);
if(preg_match_all('/([\w!#\$%\"*+\/=?\'{}|~^-
]+)@$domain/i',
$text, $matches, PREG_PATTERN_ORDER)) {
return $matches[1];
}
return false;
}
Notice here that, in addition to using the
ii
modifier,
we also use
pprreegg__qquuoottee
to sanitize
$$ddoommaaiinn
. Data that
can potentially come from an untrusted source (such as
a user) should always be quoted to prevent the acci-
dental or malicious inclusion of regex characters. Also,
we use the
PPRREEGG__PPAATTTTEERRNN__OORRDDEERR
flag so that all the sub-
pattern
\\11
matches are stored in
$$mmaattcchheess[[11]]
.
Otherwise we would need to iterate over
$$mmaattcchheess
and
21 ?>
Listing 2
“Anchors are also
useful for extract-
ing information
near the begin-
ning or end of a
string.”
•
mm
(treat as multiline). By default, PCRE
assumes that we intend our search text to
processed as one big string, and
^^
and
$$
will match only the beginning
and ending of the search text,
respectively. When the
mm
modi-
fier is used,
^^
and
$$
will match
at the beginning and ending of
every line in the pattern (the
search text is considered to be
broken into lines by any new-
[.\s-]? # An optional delimiter - dot, dash or
ws
(\d{4}) # Match the line number as subpattern 3
/x
More information of creating readable pat-
terns will be covered in a future article.
•
AA
(Start anchored) This modifier is equiva-
lent to putting a
^^
at the start of our pat-
tern—it anchors the pattern at the start of
the search text. Thus the following two
regular expressions are equivalent:
/^Subject: (.*)/
/Subject: (.*)/A
There are no benefits of using this method
over manually anchoring a pattern with
^^
(other than, perhaps, moving the anchor
character from the beginning of your pat-
tern to its end).
•
DD
(Dollar end-only) If this modifier is set, the
dollar end-anchor
$$
will match only at the
end of the string. By default,
(UTF-8) This modifier instructs PCRE to
treat patterns and search texts as UTF-8
characters instead of just single-byte charac-
ters. UTF-8 support is still new and should
be used with some caution as it may be
incomplete.
•
ee
(Evaluated replacements). This causes the
replacement string in a
pprreegg__rreeppllaaccee
call to
be evaluated as PHP. Back-references are
expanded and the resulting expression is
executed via
eevvaall
. The result of the evalua-
tion is used as the final replacement text.
Let's try an example of how to use this writ-
ing Wiki-style links to documents. In Wikis,
putting so-called CamelCaps text in a docu-
ment will link it to the wiki page of that
name. Doing this blindly with a regex can
be achieved with the following replacement:
$text = preg_replace('/\b(([A-Z]\w+){2,})\b/',
'<a href="/wiki/\1.html">\1</a>', $text);
This might result in a number of non-exis-
tent documents being linked to, though. If
March 2004
●