Tài liệu Solr 1.4 Enterprise Search Server- P1 - Pdf 87

Solr 1.4 Enterprise
Search Server
Enhance your search with faceted navigation, result
highlighting, fuzzy queries, ranked scoring, and more
David Smiley
Eric Pugh

BIRMINGHAM - MUMBAI
This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009
4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Solr 1.4 Enterprise Search Server
Copyright © 2009 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the authors, nor Packt
Publishing, and its dealers and distributors will be held liable for any damages
caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
First published: August 2009
Production Reference: 1120809
Published by Packt Publishing Ltd.
32 Lincoln Road
Olton
Birmingham, B27 6PA, UK.
ISBN 978-1-847195-88-3

Leena Purkait
Proofreader
Lynda Sliwoski
Production Coordinator
Shantanu Zagade
Cover Work
Shantanu Zagade
This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009
4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
About the Authors
Born to code,
David Smiley
is a senior software developer and loves
programming. He has 10 years of experience in the defense industry at MITRE,
using Java and various web technologies. David is a strong believer in the
opensource development model and has made small contributions to various
projects over the years.
David began using Lucene way back in 2000 during its infancy and was immediately
excited by it and its future potential. He later went on to use the Lucene based
"Compass" library to construct a very basic search server, similar in spirit to Solr.
Since then, David has used Solr in a major search project and was able to contribute
modications back to the Solr community. Although preferring open source
solutions, David has also been trained on the commercial Endeca search platform
and is currently using that product as well as Solr for different projects.
This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009
4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Most, if not all, authors seem to dedicate their book to someone. As
simply a reader of books, I have thought of this seeming prerequisite
as customary tradition. That was my feeling before I embarked on
writing about Solr, a project that has sapped my previously "free"

Thank you all.
This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009
4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Fascinated by the 'craft' of software development,
Eric Pugh
has been heavily
involved in the open source world as a developer, committer, and user for the
past ve years. He is an emeritus member of the Apache Software Foundation
and lately has been mulling over how we move from the read/write Web to the
read/write/share Web.
In biotech, nancial services, and defense IT, he has helped European and
American companies develop coherent strategies for embracing open source
software. As a speaker, he has advocated the advantages of Agile practices in
software development.
Eric became involved with Solr when he submitted the patch SOLR-284 for Parsing
Rich Document types such as PDF and MS Ofce formats that became the single
most popular patch as measured by votes! The patch was subsequently cleaned
up and enhanced by three other individuals, demonstrating the power of the
open source model to build great code collaboratively. SOLR-284 was eventually
refactored into Solr Cell as part of Solr version 1.4.
He blogs at
/>.
This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009
4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Throughout my life I have been helped by so many people, but all
too rarely do I get to explicitly thank them. This book is arguable
one of the high points of my career, and as I wrote it, I thought about
all the people who have provided encouragement, mentoring, and
the occasional push to succeed. First off, I would like to thank Erik
Hatcher, author, entrepreneur, and great family man for introducing

software engineer at IBM's Hursley Park laboratory—a role which taught him many
things, most importantly, his desire to work in a small company.
In January 2008, James founded WebMynd Corp., which received angel funding
from the Y Combinator fund, and he relocated to San Francisco. WebMynd is one
of the largest installations of Solr, indexing up to two million HTML documents
per day, and making heavy use of Solr's multicore features to enable a partially
active index.
Jerome Eteve
holds a BSC in physics, maths and computing and an MSC in IT
and bioinformatics from the University of Lille (France). After starting his career in
the eld of bioinformatics, where he worked as a biological data management and
analysis consultant, he's now a senior web developer with interests ranging from
database level issues to user experience online. He's passionate about open source
technologies, search engines, and web application architecture. At present, he is
working since 2006 for Careerjet Ltd, a worldwide job search engine.
This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009
4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Table of Contents
Preface 1
Chapter 1: Quick Starting Solr 7
An introduction to Solr 7
Lucene, the underlying engine 8
Solr, the Server-ization of Lucene 8
Comparison to database technology 9
Getting started 10
The last official release or fresh code from source control 11
Testing and building Solr 12
Solr's installation directory structure 13
Solr's home directory 15
How Solr finds its home 15

Sorting 44
Dynamic fields 45
Using copyField 46
Remaining schema.xml settings 47
Text analysis 47
Configuration 48
Experimenting with text analysis 50
Tokenization 52
WorkDelimiterFilterFactory 53
Stemming 54
Synonyms 55
Index-time versus Query-time, and to expand or not 57
Stop words 57
Phonetic sounds-like analysis 58
Partial/Substring indexing 60
N-gramming costs 61
Miscellaneous analyzers 62
Summary 63
Chapter 3: Indexing Data 65
Communicating with Solr 65
Direct HTTP or a convenient client API 65
Data streamed remotely or from Solr's filesystem 66
Data formats 66
Using curl to interact with Solr 66
Remote streaming 68
Sending XML to Solr 69
Deleting documents 70
Commit, optimize, and rollback 70
Sending CSV to Solr 72
Configuration options 73

Matching all the documents 99
Mandatory, prohibited, and optional clauses 99
Boolean operators 100
Sub-expressions (aka sub-queries) 101
Limitations of prohibited clauses in sub-expressions 102
Field qualifier 102
Phrase queries and term proximity 103
Wildcard queries 103
Fuzzy queries 105
Range queries 105
Date math 106
Score boosting 107
Existence (and non-existence) queries 107
Escaping special characters 108
Filtering 108
Sorting 109
Request handlers 110
Scoring 112
Query-time and index-time boosting 113
Troubleshooting scoring 113
Summary 115
This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009
4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Table of Contents
[
iv
]
Chapter 5: Enhanced Searching 117
Function queries 117
An example: Scores influenced by a lookupcount 118

Alphabetic range bucketing (A-C, D-F, and so on) 148
Faceting dates 149
Date facet parameters 151
Faceting on arbitrary queries 152
Excluding filters 153
The solution: Local Params 155
Facet prefixing (term suggest) 156
Summary 158
This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009
4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Table of Contents
[
v
]
Chapter 6: Search Components 159
About components 159
The highlighting component 161
A highlighting example 161
Highlighting configuration 163
Query elevation 166
Configuration 167
Spell checking 169
Schema configuration 169
Configuration in solrconfig.xml 171
Configuring spellcheckers (dictionaries) 173
Processing of the q parameter 175
Processing of the spellcheck.q parameter 176
Building the dictionary from its source 176
Issuing spellcheck requests 177
Example usage for a mispelled query 178

HTTP server request access logs 201
Solr application logging 203
Configuring logging output 203
Logging to Log4j 204
Jetty startup integration 205
Managing log levels at runtime 205
A SearchHandler per search interface 207
Solr cores 208
Configuring solr.xml 208
Managing cores 209
Why use multicore 210
JMX 212
Starting Solr with JMX 212
Take a walk on the wild side! Use JRuby to extract JMX information 215
Securing Solr 217
Limiting server access 217
Controlling JMX access 220
Securing index data 220
Controlling document access 221
Other things to look at 221
Summary 222
Chapter 8: Integrating Solr 223
Structure of included examples 223
Inventory of examples 224
SolrJ: Simple Java interface 224
Using Heritrix to download artist pages 226
Indexing HTML in Solr 227
SolrJ client API 230
Indexing POJOs 234
When should I use Embedded Solr 235

Chapter 9: Scaling Solr 271
Tuning complex systems 271
Using Amazon EC2 to practice tuning 273
Firing up Solr on Amazon EC2 274
Optimizing a single Solr server (Scale High) 276
JVM configuration 277
HTTP caching 277
Solr caching 280
Tuning caches 281
Schema design considerations 282
Indexing strategies 283
Disable unique document checking 285
Commit/optimize factors 285
Enhancing faceting performance 286
Using term vectors 286
Improving phrase search performance 287
The solution: Shingling 287
Moving to multiple Solr servers (Scale Wide) 289
Script versus Java replication 289
Starting multiple Solr servers 290
Configuring replication 291
Distributing searches across slaves 291
Indexing into the master server 292
Configuring slaves 292
Distributing search queries across slaves 293
Sharding indexes 295
Assigning documents to shards 296
Searching across shards 297
Combining replication and sharding (Scale Deep) 298
Summary 300

4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Preface
[
2
]
What this book covers
Chapter 1, Quick Starting Solr introduces Solr to the reader as a middle ground
between database technology and document/web crawlers. The reader is guided
through the Solr distribution including running the sample conguration with
sample data.
Chapter 2, The Schema and Text Analysis is all about Solr's schema. The schema
design is an important rst order of business along with the related text
analysis conguration.
Chapter 3, Indexing Data details several methods to import data; most of them can
be used to bring the MusicBrainz data set into the index. A popular Solr extension
called the DataImportHandler is demonstrated too.
Chapter 4, Basic Searching is a thorough reference to Solr's query syntax from the
basics to range queries. Factors inuencing Solr's scoring algorithm are explained
here, as well as diagnostic output essential to understanding how the query worked
and how a score is computed.
Chapter 5, Enhanced Searching moves on to more querying topics. Various score
boosting methods are explained from those based on record-level data to those that
match particular elds or those that contain certain words. Next, faceting is a major
subject area of this chapter. Finally, the term auto-complete is demonstrated, which
is implemented by the faceting mechanism.
Chapter 6, Search Components covers a variety of searching extras in the form of
Solr "components", namely, spell-check suggestions, highlighting search results,
computing statistics of numeric elds, editorial alterations to specic user queries,
and nding other records "more like this".
Chapter 7, Deployment transits from running Solr from a developer-centric perspective

<!-- <defaultSearchField>text</defaultSearchField>
<solrQueryParser defaultOperator="AND"/> -->
<copyField source="r_name" dest="r_name_sort" />
When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are set in bold:
<arr name="id">
<str>mccm.pdf</str>
</arr>
Any command-line input or output is written as follows:
>> curl http://localhost:8983/solr/karaoke/update/ -H "Content-Type:
text/xml" --data-binary '<commit waitFlush="false"/>'
New terms and important words are shown in bold. Words that you see on the
screen, in menus or dialog boxes for example, appear in the text like this: "Take for
example the Top Voters section ".
Warnings or important notes appear in a box like this.
Tips and tricks appear like this.
This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009
4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Preface
[
4
]
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about
this book—what you liked or may have disliked. Reader feedback is important for us
to develop titles that you really get the most out of.
To send us general feedback, simply send an email to

, and
mention the book title via the subject of your message.

by selecting your title from
/>.
This material is copyright and is licensed for the sole use by William Anderson on 26th August 2009
4310 E Conway Dr. NW, , Atlanta, , 30327Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.


Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status