Contents
Overview 1
Components of a SharePoint Portal Server
Search 2
Adding Content Sources 13
Managing Content Sources 28
Lab A: Adding External Content to a
Workspace 42
Review 48
Module 6: Adding and
Managing External
Content
Information in this document is subject to change without notice. The names of companies,
products, people, characters, and/or data mentioned herein are fictitious and are in no way intended
to represent any real individual, company, product, or event, unless otherwise noted. Complying
with all applicable copyright laws is the responsibility of the user. No part of this document may
be reproduced or transmitted in any form or by any means, electronic or mechanical, for any
purpose, without the express written permission of Microsoft Corporation. If, however, your only
means of access is electronic, permission to print one copy is hereby granted.
features of SharePoint Portal Server.
Define content source and describe the types of content that are supported,
how a content source is used, and how to add a content source.
Manage a content source by setting schedules, scope, and rules, and
describe additional functions that apply to content sources.
Materials and Preparation
This section provides the materials and preparation tasks that you need to teach
this module.
Required Materials
To teach this module, you need the Microsoft PowerPoint
®
file 2095a_6.ppt.
Preparation Tasks
To prepare for this module, you should:
Read all of the materials for this module.
Complete the lab.
Instructor Setup for a Lab
This section provides setup instructions that are required to prepare the
instructor computer or classroom configuration for a lab.
Lab A: Adding External Content to a Workspace
To prepare for the lab
• Classroom configured according to the setup guide for course 2059a.
Customization Information
This section identifies the lab setup requirements for a module and the
configuration changes that occur on student computers during the labs. This
information is provided to assist you in replicating or customizing Training and
Certification courseware.
The lab in this module is also dependent on the classroom
configuration that is specified in the Customization Information section in the
Classroom Setup Guide for Course 2095A, Implementing Microsoft
®
SharePoint
™
Portal Server 2001.
Lab Setup
The following list describes the setup requirements for the lab in this module.
Setup Requirement 1
The lab in this module requires no additional configuration. To prepare student
computers to meet this requirement, perform the following actions:
Configure the instructor computer according to the classroom setup guide
for course 2095a.
Configure the student computers according to the classroom setup guide of
course 2095a.
Lab Results
There are no configuration changes on student computers that affect replication
of customization.
service is a full-text indexing and search engine that is used to crawl, retrieve,
create and update indexes for this content. This module discusses this process
and examines the use of content sources for accessing content that is external to
the SharePoint Portal Server computer.
After completing this module, you will be able to:
Describe the components that are used in the searching and indexing
features of SharePoint Portal Server.
Define content source and describe the types of content that are supported,
how a content source is used, and how to add a content source.
Manage a content source by setting schedules, scope, and rules, and
describe additional functions that apply to content sources.
Topic Objective
To provide an overview of
the module topics and
objectives.
Lead-in
In this module, you will learn
about adding and managing
content with SharePoint
Portal Server.
2 Module 6: Adding and Managing External Content
MSSearch.
Module 6: Adding and Managing External Content 3 The Gatherer
Accessing
Accessing
Indexing
Indexing
Filtering
Filtering
Filter
Daemon
Process
Core Component of MSSearch
Manages How Content Is Accessed, Filtered, and Indexed
Includes Native and Registered Protocol Handlers
*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
*****************************
component of SharePoint
Portal Server MSSearch.
Note
4 Module 6: Adding and Managing External Content Using Protocol Handlers to Access Data Store Content
The Gatherer accesses documents in a data store by using the appropriate
protocol by way of a protocol handler interface. The protocol handler, which
has no relation to network protocol, is an interface between the index and
SharePoint Portal Server. When the Gatherer processes a Uniform Resource
Locater (URL) during indexing, the filter daemon determines which protocol
handler to use based on the URL prefix, loads the associated dynamic link
library (DLL), and passes the URL and security credentials to the protocol
handler.
Native Protocol Handlers
SharePoint Portal Server includes native protocol handlers, or handlers that
ship with the product, for Hypertext Transfer Protocol (HTTP), file, Microsoft
Exchange 5.5, Microsoft Exchange 2000 Server, and Lotus Notes.
Exchange 2000 and SharePoint Portal Server share the Web Storage System
technology and the same protocol handler. This protocol handler accesses a
local Web Storage System by using Microsoft OLE DB Provider for
Exchange 2000 Server (EXOLEDB) and uses Web Distributed Authoring and
Versioning (WebDAV) to access the Web Storage System on a remote
Exchange or SharePoint Portal Server computer.
Registered Protocol Handlers
The following table lists the registered protocol handlers that are included with
SharePoint Portal Server.
Prefix DLL ProgID
TIFF (mspfilt.dll)
TIFF (mspfilt.dll)
Null Filter (tquery.dll)
Null Filter (tquery.dll)
Extract Content and Properties from Documents
Open Data Streams and Expose the Data as Indexable
Chunks
SharePoint Portal Server Provides IFilters for:
*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
****************************
IFilters are the components of MSSearch that extract a document’s content and
its properties.
How IFilters Work
During the filter daemon process, IFilters open data streams and expose the data
so that it can be indexed. In particular, the Hypertext Markup Language
(HTML) filter strips a document of all HTML tags and emits various HTML
syntactic elements as properties, such as author or title, and also emits the body
text. Each file type, indicated by its file extension, has an IFilter associated with
it.
Null filter tquery.dll
Module 6: Adding and Managing External Content 7 Word Breakers and Noise Words
Loem Ipsum arnet
Word Breakers
Break words apart
Remove punctuation and symbols
Follow language-specific rules
Follow special case rules
Noise Words
Words that do not add value to a query (“and”, “the”)
MSSearch filters out noise words
*****************************I
LLEGAL FOR
N
ON
-T
RAINER
In this topic, we will examine
how word breakers and
noise words are used to
facilitate indexing.
8 Module 6: Adding and Managing External Content Using Word Breakers in Indexing
The content index uses the word breaker component in the following two
situations:
When an index is created or updated. The word breaker splits all text that is
referenced by the content index. The index is updated continuously as
documents are modified and closed.
At query time. A word breaker is used to break query strings into words and
phrases. For more information about word breaking at query time, see Module 7,
“Searching for Content,” in Course 2095A, Implementing Microsoft
®
SharePoint
™
Portal Server 2001.
Using SharePoint Portal Server and Operating System Word Breakers
The word breakers included in SharePoint Portal Server override existing
operating system word breakers. SharePoint Portal Server calls the operating
Active plug-in
Default Plug-ins
Auto-Categorization Module
plug-in
PQS plug-in
Indexing plug-in
Gatherer plug-in
*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
SE
*****************************
A plug-in is a component that resides in the Gatherer data pipeline and
processes the data that is emitted by the content filters. The Gatherer Project
uses plug-ins to process the text and properties of collected content.
Plug-in Categories
The Gatherer includes the following two categories of plug-ins:
Consumer plug-in. This plug-in uses only the text and properties that are
emitted and does not affect the pipeline.
Checks the schema to determine which properties to include in the index.
For properties that are retrievable in user search queries, it will save the data
in the property store. For properties and text that is marked for indexing, it
will perform additional processing and store the full-text index.
Regulates the amount of data that is being passed to the full-text engine by
blocking the data pipeline when a threshold is reached.
Saves the data to the Jet property store. The data is saved in the property
store first and then the indexing engine saves the data. The property store is
located at %program files%\SharePoint Portal Server\data\ftdata\SharePoint
Portal Server.
Gatherer Plug-In
The Gatherer plug-in can be thought of as the crawl manager. It receives the
call to start a crawl, checks for crawl restrictions, and maintains the crawl queue
and history. It is present in every Gatherer project, regardless of the
configuration.
Module 6: Adding and Managing External Content 11 Indexing Database
Index
The Indexing Database Provides a Consistent
Structure for
Word lists
indexes, building on existing shadow indexes and word lists.
Because shadow indexes cannot be modified, the number of shadow indexes in
the content index will grow over time as new word lists are converted to
shadow indexes.
Topic Objective
To explain the function of an
indexing database and its
collection of four indexes.
Lead-in
In this topic, we will examine
how SharePoint Portal
Server provides a consistent
structure for the
components of the indexing
database.
12 Module 6: Adding and Managing External Content Master Index
Because the access time for a shadow index is almost constant regardless of
size, content index performance will decrease as more shadow indexes are
created. Therefore, it is advantageous to merge shadow indexes into a master
index. In SharePoint Portal Server, this process is called a master merge and it
happens by default every night at midnight, after a specific number of
documents have been indexed or if disk space gets too low. You cannot
manually initiate the creation of a master index. The master index, which is the
final repository for all indexing information, is by far the largest index. The
optimal content index is a master index, with no word lists or shadow indexes.
The content of the word lists and shadow indexes now exists only in the master
index.
outside the workspace, by means of content sources. SharePoint Portal Server
provides read access to, and searching within, content sources, but content
sources cannot be edited, checked in, or checked out. This section describes
some of the basic features of content sources and how to add them to your
Content Sources folder.
Topic Objective
To outline this topic.
Lead-in
In this section, you will learn
about the basic procedure
for adding a content source.
14 Module 6: Adding and Managing External Content Adding a Content Source
Content
Management
Content
Sources
~~~ ~~~ ~~~
Users
Index
*****************************I
LLEGAL FOR
N
ON
-T
RAINER
U
Types of Content Sources
When you add a content source to the Content Sources folder, you must provide
an address or URL for that content. The following table lists the types of
information that you can add to the workspace as a content source.
Source type Sample address
Web site or Web page. http://www.microsoft.com/
File share. file://server/share/page.htm
-OR- \\server\share\folder
Exchange 5.5 public folder. The
SharePoint Portal Server computer must
be configured to crawl this type of folder.
http://server/Public/Public Folders
-OR- exch://backofficestorage/
Exchange 2000 public folder. http://server/Public/Public Folders/folder
Lotus Notes database. Before you can
create this content source, the Lotus Notes
client must be properly installed on the
SharePoint Portal Server computer, and
the computer must be properly configured
with the NotesSetup utility.
Provide the name of the database and the
address of the database server, such as:
//noteserver
Other SharePoint Portal Server
workspaces.
http://server/workspace/folder/
Creating and Updating an Index of the Content
On a regular basis, SharePoint Portal Server creates and updates an index of the
source to your workspace, you must have read access to the source, know where
the content source files are stored, and know how the files will be searched.
Before you can add a content source to the workspace, the workspace
administrator must specify a default content access account.
If the administrator has not configured a default account for SharePoint Portal
Server to crawl, the wizard will prompt for one. This account will be used to
connect to the content source. SharePoint Portal Server also will allow you to
create indexes immediately, or you may choose to do so later.
To add a content source to your SharePoint Portal Server workspace:
1. Specify the location of the external content that you want to add to the
workspace.
You can add any one of five types of content sources using the Content Source Wizard.
You must choose content that is external to the current
workspace.
2. Open the Management folder, and then open the Content Sources folder.
3. Double-click Add Content Source.
4. The Add Content Source Wizard opens.
a. Define the content type by selecting the content source type that you
want to incorporate into the index.
b. Provide a path that directs SharePoint Portal Server to the linked content
by providing an address or URL for Web content or by providing the
database address and name for a Lotus Notes database.
The new content source is placed in the Content Sources folder. The
information available from the source is included in the workspace index and is
available for users to search for and view on the dashboard site.
For information about content access accounts, see Module 9, “Managing
SharePoint Portal Server,” in Course 2095A, Implementing Microsoft
U
SE
*****************************
Adding a Web content source for a Web server, network file share, and remote
SharePoint Portal Server workspace requires a simple URL or Uniform Naming
Convention (UNC) file path.
To add a Web content source:
1. Run the Add Content Source Wizard.
2. Select Web Site, File Share, or SharePoint Portal Server as the content type.
3. Enter a valid URL or UNC path to the content, and specify the desired crawl
depth.
4. Assign a unique display name to the content source.
5. On the Finish page, you can choose to start the full build immediately, or
you can initiate it later.
For network file shares, you can specify any standard shared folder on a
Windows file system. MSSearch is also able to crawl mounted network file
shares on other operating systems that support the server message block (SMB)
protocol. For example IBM OS/2, Novell Netware, and UNIX running an SMB
service like Samba.
In Microsoft Site Server 3.0, users can map custom properties stored
in HTML META tags to Office properties using the text files schema.txt and
gathererprm.txt so that the metadata will be indexed. SharePoint Portal Server
version 1 does not support schema mapping using these files. Custom properties
in META tags will not be included in the index if they match properties in the
SharePoint Portal Server schema.
Topic Objective
To describe how to add a
and Integrated Windows authentication. When accessing file systems other than
Windows, such as UNIX or Netware, you must use the Basic authentication
method.
When crawling content in a non-trusted domain, you must use the
Basic authentication method, which you can set by using a site path rule. You
also cannot set a default content access account that resides in a non-trusted
domain. Be careful when you set the crawl settings. If you configure a site to
follow all links, make sure that you are aware of the depth and size of the site.
You might use excessive bandwidth and not have enough disk space to crawl
large sites.
Important
Warnin
g
Module 6: Adding and Managing External Content 19 Adding an Exchange 5.5 Content Source
Required
Required
Required
The Outlook 2000 client must be installed
The Exchange server name
The Outlook Web Access server name
The Exchange site the server belongs to
The Exchange organization the server belongs to
server that is being indexed. You do not need to use Outlook Web Access,
but if you do not, SharePoint Portal Server requires additional configuration
to crawl the public folders.
The Exchange site that the server belongs to.
The Exchange organization that the server belongs to.
An access account with Administrator privilege on the Organization
container. Enter the name of the site and the name of the organization exactly,
including the correct capitalization. To crawl the public folders that reside on a different server, you must
replicate the folders to the crawled server. For information about replicating
public folders, see Module 10, “Examining an Enterprise-Level
Implementation,” in Course 2095A, Implementing Microsoft
®
SharePoint
™
Portal Server 2001.
Topic Objective
To describe how to add an
Exchange 5.5 content
source.
Lead-in
all public folders, the path must end with All Public Folders/ (note trailing slash
mark).
For Your Information
Site Server 3.0 Search
crawling Exchange 5.5
setup was very similar to
SharePoint Portal Server
crawling an Exchange 5.5
content source. However,
Site Server required
MSSearch to run in the
context of the Exchange
Administrator account. With
SharePoint Portal Server,
the service runs as the local
system account and
impersonates the Exchange
account only when crawling
and performing security
validations on search
results.
Module 6: Adding and Managing External Content 21 Adding an Exchange 2000 Content Source
Index
Exchange
Public Folders
SharePoint Portal Server Indexes Any Items That
SharePoint Portal Server properties, just as with documents inside a
SharePoint Portal Server Web folder.
Attachments that the Gatherer usually filters. For example, an htm file is
included in the index. However, the search results for an attachment display
the subject and author of the message. For more information about installing and accessing Outlook Web
Access, see the Exchange Server documentation.
Topic Objective
To describe how to add an
Exchange 2000 content
source.
Lead-in
In this topic, we will explore
how to add an
Exchange 2000 content
source.
Note