SQL Server MVP Deep Dives- P8 - Pdf 76

236
C
HAPTER
17
Build your own index

Tester_sp
calls the tested procedure with four different search strings, and
records the number of rows returned and the execution time in milliseconds. The
procedure makes two calls for each search string, and before the first call for each
string,
tester_sp
also executes the command
DBCC

DROPCLEANBUFFERS
to flush the
buffer cache. Thus, we measure the execution time both when reading from disk and
when reading from memory.
Of the four search strings, two are three-letter strings that appear in 10 and 25
email addresses respectively. One is a five-letter string that appears in 1978 email
addresses, and the last string is a complete email address with a single occurrence.
Here is how we test the
plain_search
procedure. (You can also find this script in
the file 02_plain_search.sql.)
CREATE PROCEDURE plain_search @word varchar(50) AS
SELECT person_id, first_name, last_name, birth_date, email
FROM persons WHERE email LIKE '%' + @word + '%'
go
EXEC tester_sp 'plain_search'

SQL
Server can examine whether the second character in the column matches
the first letter of the search string, and move on if it doesn’t. But for
LIKE
,
SQL
Server
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
237
Fragments and persons
must examine every character at least once. On top of that, the collation in the test
database is a Windows collation, so
SQL
Server applies the complex rules of Unicode.
(The fact that the data type of the column is
varchar
does not matter.)
This has an important ramification when designing our search routines: we should
try to minimize the use of the
LIKE
operator.
Using a binary collation
One of the alternatives for improving the performance of the
LIKE
operator is to
force a binary collation as follows:
COLLATE Latin1_General_BIN2 LIKE '%' + @word + '%'
With a binary collation, the complex Unicode rules are replaced by a simple byte com-
parison. In the file 02_plain_search.sql, there is the procedure

Fragments and persons
We will now look at the first solution in which we build our own index to get good per-
formance with searches using
LIKE
, even on tens of millions of rows.
To achieve this, we first need to introduce a restriction for the user. We require his
search string to contain at least three contiguous characters. Next we extract all three-
letter sequences from the email addresses and store these fragments in a table together
with the
person_id
they belong to. When the user enters a search string, we split up
the search string into three-letter fragments as well, and look up which persons they
map to. This way, we should be able to find the matching email addresses quickly.
This is the strategy in a nutshell. We will now go on to implement it.
The fragments_persons table
The first thing we need is to create the table itself:
CREATE TABLE fragments_persons (
fragment char(3) NOT NULL,
person_id int NOT NULL,
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
238
C
HAPTER
17
Build your own index
CONSTRAINT pk_fragments_persons PRIMARY KEY (fragment, person_id)
)
You find the script for this table in the file 03_fragments_persons.sql. This script also
creates a second table that I will return to later. Ignore it for now.

SELECT w.frag, p.person_id
FROM persons p
CROSS APPLY wordfragments(p.email) AS w
This may not be optimal, though, as loading all rows in one go could cause the trans-
action log to grow excessively. The script 03_fragments_persons.sql includes the
stored procedure
load_fragments_persons
, which runs a loop to load the fragments
for 20,000 persons at a time. The demo database for this chapter is set to simple recov-
ery, so no further precautions are needed. For a production database in full recovery,
you would also have to arrange for log backups being taken while the procedure is
running to avoid the log growth.
If you have created the database, you may want to run the procedure now. On my
computer the procedure completes in 7–10 minutes.
Writing the search procedure
Although the principle for the table should be fairly easy to grasp, writing a search
procedure that uses it is not as trivial as it may seem. I went through some trial and
error, until I arrived at a good solution.
Before I go on, I should say that to keep things simple I ignore the possibility that
the search string may include wildcards like
%
or
_
, as well as range patterns like
[a-d]
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
239
Fragments and persons
or

string? To be able to determine which fragment this is, I introduced a second table as
follows:
CREATE TABLE fragments_statistics
(fragment char(3) NOT NULL,
cnt int NOT NULL,
CONSTRAINT pk_fragments_statistics PRIMARY KEY (fragment)
)
The script 03_fragments_persons.sql creates this table, and the stored procedure
load_fragments_persons
loads the table in a straightforward way:
INSERT fragments_statistics(fragment, cnt)
SELECT fragment, COUNT(*)
FROM fragments_persons
GROUP BY fragment
Not only do we have our own index, we now also have our own statistics!
Equipped with this table, I finally made progress, but I was still not satisfied with
the performance for the test string
[email protected]
. When data
was on disk, this search took over 4 seconds, which can be explained by the fact that
the least common fragment in this string maps to 2851 persons.
THE FINAL ANSWER
I did one final adjustment: look for persons that match both of the two least common
fragments in the search string. Listing 2 shows the procedure I finally arrived at.
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
240
C
HAPTER
17

rowno

<=

2
extracts the two least common fragments, and
with the help of
MIN
and
MAX
, we get them into variables. When we have the variables,
we run the actual search query.
You may think that a single
EXISTS
clause with a condition of
IN

(@frag1,
@frag2)
would suffice. I tried this, but I got a table scan in the fragments_persons
table, where there are two separate
EXISTS
clauses.
When I ran
map_search_five
through
tester_sp
, I got this result:
joy aam niska omamo@
Disk 373 260 4936 306

SQL
2008.
There are two versions because in the
SQL
2008 trigger I use the new
MERGE
statement.
The triggers are fairly straightforward, but there are a few things worth pointing
out. In listing 3, I show the version for
SQL
2008, as it is considerably shorter.
CREATE TRIGGER fragments_persons_tri ON persons
FOR INSERT, UPDATE, DELETE AS
SET XACT_ABORT ON
SET NOCOUNT ON
-- Exit directly if now row were affected.
IF NOT EXISTS (SELECT * FROM inserted) AND
NOT EXISTS (SELECT * FROM deleted)
RETURN
-- If this is an UPDATE, get out of email is not touched.
IF NOT UPDATE(email) AND EXISTS (SELECT * FROM inserted)
RETURN
DECLARE @changes TABLE
(fragment char(3) NOT NULL,
person_id int NOT NULL,
sign smallint NOT NULL CHECK (sign IN (-1, 1)),
PRIMARY KEY (fragment, person_id))
INSERT @changes (fragment, person_id, sign)
D
SELECT frag, person_id, SUM(sign)

242
C
HAPTER
17
Build your own index
USING (SELECT fragment, SUM(sign) AS cnt
FROM @changes
GROUP BY fragment
HAVING SUM(sign) <> 0) AS d ON fs.fragment = d.fragment
WHEN MATCHED AND fs.cnt + d.cnt > 0 THEN
UPDATE SET cnt = fs.cnt + d.cnt
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED BY TARGET THEN
INSERT (fragment, cnt) VALUES(d.fragment, d.cnt);
go
The trigger starts with two quick exits. At
B
we handle the case that the statement did
not affect any rows at all. In the case of an
UPDATE
operation, we don’t want the trigger
to run if the user updates some other column, and this is taken care of at
C
. Observe
that we cannot use a plain
IF

UPDATE
, as the trigger then would exit directly on any

F
we also update the fragments_statistics table. Because this is only a sta-
tistics table, this is not essential, but it’s a simple task—especially with
MERGE
in
SQL
2008. In
SQL
2005, this is one
INSERT
,
UPDATE
, and
DELETE
each.
To test the trigger you can use the script in the file 06_map_trigger.sql. The script
performs a few
INSERT
,
UPDATE
, and
DELETE
statements, mixed with some
SELECT
statements and invocations of
map_search_five
to check for correctness.
What is the overhead?
There is no such thing as free lunch. As you may expect, the fragments_persons table
incurs overhead. To start with, run these commands:

SQL
Server index. For a table that
holds persons, products, and similar base data, this overhead can still be acceptable, as
such tables are typically moderate in size and not updated frequently. But you should
think twice before you implement something like this on a busy transactional table.
Fragments and lists
The fragments_persons table takes up so much space because we store the same frag-
ment many times. Could we avoid this by storing a fragment only once? Yes. Consider
what we have in the following snippet:
fragment person_id
-------- ---------
aam 19673
aam 19707
aam 43131
aan 83500
aan 192379
If we only wanted to save space, we could just as well store this as follows:
fragment person_ids
-------- -----------------
aam 19673,19707,43131
aan 83500,192379
Most likely, the reader at this point gets a certain feeling of unease, and starts to ask all
sorts of questions in disbelief, such as

Doesn’t this violate first normal form?

How do we build these lists in the first place?

And how would we use them efficiently?


UDA
), a capability that was
added in
SQL
2005. You cannot write a
UDA
in
T-SQL
, but you must implement it in a
CLR
language such as
C#
. In
SQL
2005, a
UDA
cannot return more than 8,000 bytes, a
restriction that was removed in
SQL
2008. Thankfully, in practice this restriction is
insignificant, as we can work with the data in batches.
In the download archive you can find the files integerlist-2005.cs and integerlist-
2008.cs with the code for the
UDA
, as well as the compiled assemblies. The assemblies
were loaded by 01_build_database.sql, so all you need to do at this point is to define
the
UDA
as follows:
CREATE AGGREGATE integerlist(@int int) RETURNS varbinary(MAX)

WHERE n <= datalength(@str) / 4)
DISTINCT
is needed because there is no way to guarantee that these lists have unique
entries. As we shall see later, this is more than a theoretical possibility.
This is an inline table-valued function (
TVF
), and normally that is preferred over a
multi-statement function, because an inline
TVF
is expanded into the query, and the
optimizer can work with the expanded query. This is not the case with a multi-
statement
TVF
which also requires intermediate storage. I found when testing various
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
245
Fragments and lists
queries that the optimizer often went astray, and using a multi-statement function
gave me better performance. A multi-statement function also permitted me to
improve performance by using the
IGNORE_DUP_KEY
option in the definition of the
table variable's primary key and thereby remove the need for
DISTINCT
:
CREATE FUNCTION binlist_to_table_m2(@str varbinary(MAX))
RETURNS @t TABLE
(n int NOT NULL PRIMARY KEY WITH (IGNORE_DUP_KEY = ON)) AS
I have to admit that I am not a big fan of

for the email addresses con-
taining the fragment. The column listlen is used when maintaining the table. There
may not be much point to have it persisted, but nor is the cost likely to be high.
You find the definition of this table in the files 08_fragments_personlists-2008.sql
and 08_fragments_personlists-2005.sql. These files also include the preceding
CREATE
AGGREGATE
statement, and the load procedure for the table, which is what we will look
at next.
Loading the table
The conceptual query to load this table is simple:
INSERT fragments_personlists
(fragment, stored_person_list, no_of_entries)
SELECT w.frag, dbo.integerlist(w.person_id), COUNT(*)
FROM persons p
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
246
C
HAPTER
17
Build your own index
CROSS APPLY wordfragments(p.email) w
GROUP BY w.frag
Because of the size limitations imposed on
UDA:s
in
SQL
2005, this query will not run
on this version of

AND p.rowno < @batchstart + @batchsize
GROUP BY w.frag
)
MERGE fragments_personlists AS fp
USING personlists AS p ON fp.fragment = p.fragment
WHEN MATCHED THEN UPDATE
SET no_of_entries = fp.no_of_entries + p.cnt,
stored_person_list.write(p.person_list +
CASE WHEN fp.listlen < 7000 AND
fp.listlen < 4 *
(fp.no_of_entries + p.cnt)
THEN convert(varbinary(2000),
replicate(0x0, 4 *
(fp.no_of_entries + p.cnt)))
ELSE 0x
END,
4 * fp.no_of_entries, 4 * p.cnt)
WHEN NOT MATCHED BY TARGET THEN
Listing 4 Loading the fragments_personlists table
B
C
D
E
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
247
Fragments and lists
INSERT(fragment, no_of_entries, stored_person_list)
VALUES (p.fragment, p.cnt, p.person_list +
CASE WHEN p.cnt < 7000

email, and like any other nonclustered index, this index also includes the key of the
clustered index, which in the case of the persons table is the primary key, and thus the
index covers the query.
The next
CTE
,
personlists
at
D
, performs the aggregation from the batch. The
MERGE
statement then inserts new rows or updates existing ones in a fairly straightfor-
ward fashion, save for the business that goes on at
E
and
F
. This is the pre-allocation
scheme that I mentioned earlier. You can perform pre-allocation in many ways, and
choosing a scheme involves trade-offs for speed, fragmentation, and wasted space.
The scheme I’ve chosen is to allocate double the length I need now, but never allocate
more than 2000 bytes at a time. Note that when the length exceeds 7000 bytes I don’t
pre-allocate at all. This is because the fragmentation problem exists only as long as the
column is stored within the row. When the column is big enough to end up in large
object (
LOB
) storage space,
SQL
Server caters for pre-allocation itself.
Finally, at
G

2008 and for 15–17
minutes on
SQL
2005.
The system procedure
sp_spaceused
tells us that the table takes up 106
MB
, or 27
percent of the space of the fragments_persons table.
F
G
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
248
C
HAPTER
17
Build your own index
A search procedure
In the preceding section, we’ve been able to save space, but will we also be able to
write a stored procedure with the same performance we got using the fragments_
persons table?
The answer is yes, but it’s not entirely straightforward. I started with the pattern in
map_search_five
, but I found that in the query that determines the two least com-
mon fragments,
SQL
Server was scanning the fragments_personlists table. To work
around this, I saved the output from the

binlist_to_table
function. When I used the inline version, it took a minute to run
the procedure for the string niska!
The results for
list_search_four
with our test words follow:
joy aam niska omamo@
Disk 203 266 6403 473
Cache 16 0 500 46
Compared to the results for
map_search_five
, the performance is better in some
cases, but worse in others.
Listing 5 Search procedure using fragments_personlists
Licensed to Kerri Ross <[email protected]>
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
249
Fragments and lists
The file 09_list_search.sql contains the code for
list_search_four
, as well as five
other
list_search
procedures. The first three illustrate my initial attempts, and they
do not perform well. The last two are variations with more or less the same perfor-
mance as
list_search_four
.
Keeping the lists updated
As in the case with

As a consequence the person_list column for a fragment could include duplicate
entries of the same
person_id
. A simple example is when a user mistakenly changes
the email address of a person, and then restores the original address—hence, the
need for
DISTINCT
in the
binlist_to_table
function.
You can find the code for the trigger in the files 10_fragments_personlists_trigger-
2005.sql and 10_fragment_personlists_trigger-2008.sql. In the file 11_list_trigger_
test.sql there is a script for testing the trigger. I’m not including the trigger code here
in full, as it’s similar to the load procedure. The trigger for
SQL
2008 does not resort
to batching, but in the trigger for
SQL
2005 batching is unavoidable, due to the size
restriction with the
UDA
. One thing is a little different from the load procedure,
though: in case of
UPDATE
s we should not store
fragment-person_id
mappings that
do not change. Listing 6 shows how this looks in the trigger for
SQL
2005.

SQL
2005, comes in handy when dealing with this
issue. Also, observe that here the batching is done differently from the load proce-
dure. In the load procedure we numbered the rows by email for better performance,
but if we were to try this in our trigger, things could go wrong. Say that the email
address for person 123 is changed from
[email protected]
to
[email protected]
in a mass
update of more than 2,000 rows. If we number rows by email, the rows for person 123
in inserted and deleted would be in different batches, and so would the rows for at least
one more person. By batching on the primary key, we avoid this.
You can use the procedure
volume_update_sp
from 07_trigger_volume_test.sql to
measure the overhead of the trigger. I got these numbers:
SQL 2005 SQL 2008
INSERT took 23570 ms. INSERT took 11463 ms.
UPDATE took 21490 ms. UPDATE took 9093 ms.
DELETE took 610 ms. DELETE took 670 ms.
Thus on
SQL
2008, there is a considerable reduction in the overhead compared to the
trigger for the fragments_persons table. To be fair, that trigger handles deletions as
well.
Using bitmasks
The last technique we will look at uses an entirely different approach. This is not my
own invention; Sylvain Bouche developed it and was kind to share his idea with me.
In contrast to the other two techniques that rely heavily on features added in

operator on this restricted set. That is, this condi-
tion must be true:
email_bitmask & char_bitmask(@wild) = char_bitmask(@wild)
This condition cannot result in a seek of the index on
email_bitmask
, but is only
good for a scan. From the preceding equation, this condition follows:
email_bitmask >= char_bitmask(@wild)
The bitmask value for the column must be at least equal to the bitmask for the search
string. Thus, we can constrain the search to the upper part of the index. This leads to
the procedure shown in listing 7.
CREATE PROCEDURE bit_search_two @wild varchar(50) AS
SET NOCOUNT ON
DECLARE @bitmask bigint
SELECT @bitmask = dbo.char_bitmask(@wild)
SELECT person_id, first_name, last_name, birth_date, email
FROM persons
WHERE email_bitmask >= @bitmask
AND CASE WHEN email_bitmask & @bitmask = @bitmask
THEN patindex('%' + @wild + '%', email)
ELSE 0
END > 0
The sole purpose of the
CASE
statement is to make absolutely sure that
SQL
Server
evaluates only the
patindex
function for rows with matching bitmasks.

Nhờ tải bản gốc

Tài liệu, ebook tham khảo khác

SQL Server MVP Deep Dives- P8 - Pdf 76

Tài liệu, ebook tham khảo khác

Học thêm