SQL Server MVP Deep Dives- P7 - Pdf 76

196
C
HAPTER
13
Full-text searching
WHEN 0 THEN 'Newly created and not yet used'
WHEN 1 THEN 'Being used for insert'
WHEN 4 THEN 'Closed ready for query'
WHEN 6 THEN 'Being used for merge input and ready for query'
WHEN 8 THEN 'Marked for deletion. Will not be used for query and merge

source'
ELSE 'Unknown status code'
END
FROM sys.fulltext_index_fragments f
JOIN sys.tables t on f.table_id = t.object_id;
When this query returns, look for rows whose type is
4
, or
Closed ready for query
. A
table will be listed once for each fragment it has. If it turns out that you have a high
number of closed fragments, you should consider doing a
REORGANIZE
on the index
(using the
ALTER

FULLTEXT

INDEX

The
db_id()
function allows us to easily retrieve the database
ID
. We then use the
object_id
function to get the
ID
for the table name, passing in the text-based table
name. Table 6 shows a sampling of the results.
Table 6 Sample of results for query to find keywords
Keyword
Display
term
Column
Document
count
0x006C0069006700680074 light Description 7
0x006C006900670068007400650072 lighter Description 1
0x006C0069006700680074006500730074 lightest Description 1
0x006C0069006700680074007700650069006700680074 lightweight Description 11
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
197
Useful system queries
The Keyword column contains the Unicode version of the keyword in hexadecimal
format, and is used as a way to link the Display Term—the real indexed word—to
other views. The Column column is obvious; the Document Count indicates how
many times this keyword appears in the table.
One oddity about this particular

Keyword
Display
term
Column name Document ID
Occurrence
count
0x006C0069006700680074 light Description 249 1
0x006C0069006700680074 light Description 409 1
0x006C0069006700680074 light Description 457 1
0x006C0069006700680074 light Description 704 1
0x006C0069006700680074 light Description 1183 1
0x006C0069006700680074 light Description 1199 1
0x006C0069006700680074 light Description 1206 1
Table 6 Sample of results for query to find keywords (continued)
Keyword
Display
term
Column
Document
count
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
198
C
HAPTER
13
Full-text searching
Keyword and Display Term are the same as the previous view, as is the Column Name.
The Document
ID

0x006C0069006700680074 light 409 1 Alluminum-alloy frame provides a light,
stiff ride, whether you are racing in the
velodrome or on a demanding club ride
on country roads.
0x006C0069006700680074 light 457 1 This bike is ridden by race winners.
Developed with the AdventureWorks
Cycles professional race team, it has a
extremely light heat-treated aluminum
frame, and steering that allows preci-
sion control.
0x006C0069006700680074 light 704 1 A light yet stiff aluminum bar for long-
distance riding.
0x006C0069006700680074 light 1183 1 Affordable light for safe night riding;
uses 3 AAA batteries.
0x006C0069006700680074 light 1199 1 Light-weight, wind-resistant, packs to fit
into a pocket.
0x006C0069006700680074 light 1206 1 Simple and light-weight. Emergency
patches stored in handle.
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
199
Summary
Summary
This concludes our look at full-text searching with
SQL
Server 2008. We began by cre-
ating a catalog to hold our indexes, then proceeded to step two, creating the indexes
themselves. Our third step was querying the full-text indexes in a variety of ways.
Finally, we looked at some queries that will help us maintain and discover the state of
our full-text indexes.

for similar strings
Tom van Sti p h o u t
Are you a perfect speller? Is everyone in your company? How about your business
partners? Misspellings are a fact of life. There are also legitimate differences in
spelling: what Americans call rumors, the British call rumours. Steven A. Ballmer and
Steve Ballmer are two different but accurate forms of that man’s name. Your database
may contain a lot of legacy values from the days before better validation at the
point of data entry.
Overall, chances are your database already contains imperfect textual data,
which makes it hard to search. Additionally, the user may not know exactly what to
look for. When looking for a number or a date, we could search for a range, but
text is more unstructured, so database engines such as
SQL
Server include a range
of tools to find text, including the following:

EQUALS
(
=
) and
LIKE

SOUNDEX
and
DIFFERENCE

CONTAINS
and
FREETEXT


WHERE
clause, this
lookup is very fast and can’t be beat by any of the other techniques discussed later in
this chapter:
SELECT FirstName, LastName
FROM Person.Person
WHERE (LastName = 'Adams')
NOTE
Throughout this chapter, I’m using
SQL
Server 2008 and the sample
database AdventureWorks2008, available at http:
//www.codeplex.com/
SqlServerSamples.
LIKE
allows wildcards and patterns. This allows you to find data even if there’s only a
partial match. For example, this statement finds all contacts with a last name starting
with A:
SELECT FirstName, LastName
FROM Person.Person
WHERE (LastName LIKE 'A%')
The wildcards
%
and
_
are used as a placeholder for any text and any character. If you
omit wildcards altogether, the statement returns the same records as if
=
were used. If,
as in the preceding example, you use

are the built-in
functions to use. They only work for English pronunciation.
To get the
SOUNDEX
value, call the function:
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
202
C
HAPTER
14
Simil: an algorithm to look for similar strings
SELECT FirstName, LastName, SOUNDEX(LastName)
FROM Person.Person
WHERE (LastName LIKE 'A%')
SOUNDEX
returns a four-character string representing the sound of a given string. The
first character is the first letter of the string, and the remaining three are numbers
representing the sound of the first consonants of the string. Similar-sounding conso-
nants get the same value; for example the d in Adams gets a value of 3, just like a t
would. After all substitutions, Adams and Atoms have the same
SOUNDEX
value of A352.
One typical use for
SOUNDEX
is to store the values in a table, so that you can later run
fast-indexed lookups using
=
.
The

SELECT ProductDescriptionID, Description
FROM Production.ProductDescription
WHERE CONTAINS(Description, 'bike')
You might think that’s equivalent to the following:
SELECT ProductDescriptionID, Description
FROM Production.ProductDescription
WHERE (Description LIKE '%bike%')
But the two statements aren’t equivalent. The former statement finds records with the
word bike, skipping those with bikes, biker, and other forms. Changing the latter state-
ment to
LIKE

'% bike

%'
doesn’t work either, if the word is next to punctuation.
The
CONTAINS
and
FREETEXT
keywords can also handle certain forms of fuzzy
matches, for example:
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
203
CONTAINS and FREETEXT
SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description, 'FORMSOF (INFLECTIONAL, ride) ')
This statement finds words that are inflectionally similar, such as verb conjugations

<pat>visaul basic</pat>
<sub>visual basic</sub>
</replacement>
If I were writing a resume-searching application, this could come in handy.
The last option I want to cover here is the
NEAR
keyword, which looks for two words
in close proximity to each other:
SELECT Description
FROM Production.ProductDescription
WHERE CONTAINS(Description, 'bike NEAR woman')
CONTAINS
and
FREETEXT
have two cousins—
CONTAINSTABLE
and
FREETEXTTABLE
. They
return
KEY
and
RANK
information, which can be used for ranking your results:
SELECT [key], [rank]
FROM CONTAINSTABLE(Production.ProductDescription, Description, 'bike')
ORDER BY [rank] DESC
So far we’ve covered the full range of text-searching features available in
T-SQL
, and

Other uses include educational software with open-ended questions. One tantaliz-
ing option the original authors mention is to combine Simil with a compiler, which
could then auto-correct common mistakes.
Let’s look at Simil in more detail, and learn how we can take advantage of it.
In 1988, Dr. Dobb’s Journal published the Ratcliff/Obershelp algorithm for pattern
recognition (Ratcliff and Metzener, “Pattern Matching: The Gestalt Approach,” http
://
www.ddj.com/184407970?pgno=5). This algorithm compares two strings and returns a
similarity between 0 (completely different) and 1 (identical). Ratcliff and Obershelp
wrote the original version in assembly language for the 8086 processor. In 1999, Steve
Grubb published his interpretation in the C language (http:
//web.archive.org/web/
Figure 1 A form showing
similar database records
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
205
Simil
20050213075957/www.gate.net/~ddata/utilities/simil.c). This is the version I used as
a starting point for the .
NET
implementation I’m presenting here.
The purpose of Simil is to calculate a similarity between two strings.
Algorithm
The Simil algorithm looks for the longest common substring, and then looks at the
right and left remainders for the longest common substrings, and so on recursively
until no more are found. It then returns the similarity as a value between 0 and 1,
by dividing the sum of the lengths of the substrings by the lengths of the strings
themselves.
Table 1 shows an example for two spellings of the word Pennsylvania. The algo-

(none) 0
Subtotal 16
Length of original strings 24
Simil = 16/24 0.67
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
206
C
HAPTER
14
Simil: an algorithm to look for similar strings

It’s insensitive for very small strings. For example, Adams and Ramos have three
out of five characters in common, so the Simil value is 0.6. Most people
wouldn’t call those names similar.

It treats every letter the same, without regard for vowels or consonants, or for
letters that often occur together, or for the location in the string, or any other
criteria. Some other algorithms do; for example, in the English language the
letters Q and U nearly always occur together and in that order, so much so that
they could almost be considered a single letter. In a more comprehensive algo-
rithm, such occurrences could be given special consideration.
SOUNDEX
is
another algorithm that does take into account that some consonants are almost
the same (for example, d and t).

Simil can’t be precalculated, always requires a table scan, and can’t take advan-
tage of indexes. This may be a problem for large datasets.
Implementation in .NET

method of the
RatcliffObershelp
class in SimilCLR.dll. This version is the one we’re using in the next section.
To me, it was gratifying to find out that the pure .
NET
version performed 30 per-
cent better than the straight port.
Installation
SimilCLR.dll is a .
NET
assembly. An assembly is a unit of execution of a .
NET
applica-
tion.
SQL
Server 2005 introduced the ability to run .
NET
assemblies in the
SQL
Server
process space. Running inside of the
SQL
Server process offers performance benefits
over the previous method of calling an extended stored procedure. If you’re using an
older version of
SQL
Server, I suggest using the classic
DLL
from your client or middle-
tier code. All code modules discussed here can be downloaded from the book’s down-

Simil
method in the assembly, and returns the Simil value for them:
CREATE FUNCTION dbo.fnSimil(@s1 nvarchar(max), @s2 nvarchar(max))
RETURNS float WITH EXECUTE AS CALLER
AS
EXTERNAL NAME asmSimil.[SimilCLR.RatcliffObershelp].Simil
In the next section, we’ll use this function to run the Simil algorithm.
Usage
The simplest use of this function, as shown in listing 1, is a procedure that takes a pair
of strings and returns the result through the output parameter.
CREATE PROCEDURE dbo.spSimil
@str1 nvarchar(max),
@str2 nvarchar(max),
@dblSimil float output
AS
SET NOCOUNT ON
SELECT @dblSimil = dbo.fnSimil(@str1, @str2)
RETURN
You can call this procedure like this:
DECLARE @dblSimil float
EXEC dbo.spSimil 'some string', 'some other string', @dblSimil OUTPUT
SELECT @dblSimil --0.786
A more powerful use of the function, shown in listing 2, is where you search an entire
table for similar strings, only returning those more similar than some threshold value.
This procedure returns all Person records where the Person’s name is more similar to
the given name than a certain threshold.
Listing 1 Calling the
fnSimil()
function from a stored procedure
Licensed to Kerri Ross <[email protected]>

pared with the results from the previous classic
DLL
based on Steve Grubb’s work, for
all CompanyNames in the Northwind database.
This test shown in listing 3 opens an
ADO
.
NET
data table and loops over each
record. It compares the Simil value from our .
NET
assembly with the previous classic
DLL
version (the “expected” value). The two values are compared in the
Expect
method and, if not equal, an exception is thrown and the test fails.
<Test()> _
Public Sub TestCompanyNames()
Dim dt As dsNorthwind.dtCustomersDataTable = m_Customers.GetData()
For Each r1 As dsNorthwind.dtCustomersRow In dt.Rows
For Each r2 As dsNorthwind.dtCustomersRow In dt.Rows
Dim similNew As Double =

SqlServerCLR.RatcliffObershelp.Simil(r1.CompanyName, r2.CompanyName)
Dim similClassic As Double =

similClassic(r1.CompanyName, r2.CompanyName)
Dim strMsg As String = "s1=" & r1.CompanyName & ", s2=" &

r2.CompanyName & ": simil new=" & similNew & ", expected=" & similClassic

sented a modern implementation of Simil as a .
NET
assembly. With a free download
and a few simple
T-SQL
scripts you can start using it today in your applications.
About the author
Tom van Stiphout is the software development manager of Kine-
tik I.T. (http:
//www.kinetik-it.com). Tom has a degree from
Amsterdam University and came to the United States in 1991.
After a few years with C++ and Windows
SDK
programming, he
gradually focused more on database programming. He worked
with Microsoft Access from version 1.0, and Microsoft
SQL
Server from version 4.5. During the last several years, Tom
added .
NET
programming to his repertoire.
Tom has been a frequent contributor to the online news-
groups for many years. He’s a former Microsoft Regional Direc-
tor and a current Microsoft Access
MVP
.

Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
210

) framework as
well as query and view services built over the
ADO
.
NET
provider model. The Entity
Framework has its own dialect of
SQL
(Entity
SQL
or
ESQL
) and can use
ESQL
statements or
LINQ
queries to access data. Although neither framework uses
vanilla
T-SQL
as its query language, both frameworks can generate
SQL
statements
or use existing stored procedures to access the database. This chapter is not an
introduction to these frameworks as I assume that you already know their basics,
but I will discuss how these frameworks interact with
SQL
Server, especially with
respect to performance.
One way to look at the performance of an abstraction layer is to examine and
profile the

EF
use when coding against their
companies’ databases. Most people who profess a dislike for the generated code
have never seen (or have seen very little of) the generated code. For someone who
writes and tunes
T-SQL
code, code generating programs and frameworks that rely
on code generation can be worrisome if the code generation compromises data-
base performance. Both Entity Framework and
LINQ
to
SQL
have
API
calls that can
expose their generated
T-SQL
; you can also use
SQL
Profiler to look at the
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
211
LINQ to SQL and performance
generated
T-SQL
. This chapter outlines some of the performance and manageability
concerns that arise through the use of these models, beginning with the dynamic gen-
eration of
SQL

SQL
statement produces multiple plans in the cache for what is the same query.
For example,
SELECT au_fname, au_lname FROM dbo.authors WHERE au_lname = 'Smith'
would produce a different query plan from this query:
SELECT au_fname, au_lname FROM dbo.authors WHERE au_lname = 'Jones'
In simple cases like this, the
SQL
Server query optimizer can perform what’s known as
auto-parameterization, in which case either of the queries above becomes
(@1 varchar(8000))SELECT [au_fname],[au_lname] FROM [dbo].[authors] WHERE
[au_lname]=@1
LINQ
to
SQL
and
EF
make every attempt to use parameterized
SQL
, rather than
dynamic
SQL
, in their code generation. Microsoft claims that
LINQ
to
SQL
minimizes
if not eradicates the potential for
SQL
injection.

opment Team Blog, begins with this entry: http:
//blogs.msdn.com/sqlprogrammability/archive/2007/01/
08/plan-cache-concepts-explained.aspx.
CONCERN
LINQ
to
SQL
and
EF
will proliferate the use of
SQL
code in applications, and will
almost surely produce suboptimal dynamic
SQL
, causing database performance
problems and plan cache pollution.
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.


Nhờ tải bản gốc
Music ♫

Copyright: Tài liệu đại học © DMCA.com Protection Status