$NODE[] = array("localhost", 3310, "/var/run/mysqld/mysqld4.sock");
function getShardAndNodeFromUserId($userId, $common)
{
global $NODE;
1 $shardNo = shardNumber($userId);
2 $row = $NODE[$shardNo % count($NODE)];
$db_server = $row[0] == "localhost" ? ":{$row[2]}" : "{$row[0]}:{$row[1]}";
$conn = mysql_connect($db_server, 'query_user');
3 mysql_select_db("shard_$shardNo", $conn);
return array($shardNo, $conn);
}
function getShardAndNodeFromArticleId($articleId, $common) {
$query = "SELECT user_id FROM article_author WHERE article_id = %d";
mysql_select_db("common");
$result = mysql_query(sprintf($query, $articleId), $link);
$row = mysql_fetch_row($result);
return getShardAndNodeFromUserId($row[0], $common);
}
Updating or reading a shard
After
you have
identified the shard number and the node, it is time to create the
functions to retrieve information from the shards. Example 5-10 defines two such
functions:
getArticlesForUser
This function accepts a user ID and returns an array of all articles the user has
written. The partition function ensures that all articles are on the same shard, so
the function in line 1 computes the shard number shared by all the articles. The
node for the shard is then fetched in line 2. After that, the correct database name
for the shard is computed (line 3) and a single query is sent to the node to retrieve
all the articles in the shard.
FROM articles
WHERE article_id = $articleId
END_OF_SQL;
$QUERIES[] = $article_query;
$result = mysql_query($article_query, $node);
$article = mysql_fetch_object($result);
# Fetch the comments from the same shard
$comment_query = <<<END_OF_SQL
SELECT author_name, body, published
FROM comments
WHERE article_ref = $articleId
END_OF_SQL;
$result = mysql_query($comment_query, $node);
$comments = array();
while ($obj = mysql_fetch_object($result))
$comments[] = $obj;
return array($article, $comments);
}
In
this example,
we are reading from the shards directly, but if we are scaling out reads
as well, read queries should be directed to the slaves instead. Implementing this is
straightforward.
Implementing a dynamic sharding scheme
The disadvantage of the approach discussed so far is that the partition function is static,
meaning that if certain nodes get a lot of traffic, it is not straightforward to move a shard
from one node to another, since it requires a change to the application code.
An example can be found in the simple blogging application we have used so far. If a
user attracts a lot of attention because she suddenly posts some very interesting articles,
her shard will become very “hot.” This will cause an imbalance between the shards,
user_id INT UNSIGNED,
PRIMARY KEY (article_id)
);
To
find the
node location of the shard, you must change the PHP function that sends
a query so it extracts the shard location from the shard_to_node table. The necessary
changes are shown in Example 5-12. Notice that the array of nodes has disappeared
and been replaced by a query to the shard_to_node table in the common database and
that the function to compute the shard number now queries the user table to get the
shard for the user.
Example 5-12. Changes to use the new dynamic sharding scheme
function shardNumber($userId, $common)
{
$result = mysql_query("SELECT shard FROM user WHERE user_id = $userId", $common);
$row = mysql_fetch_row($result);
return $row[0];
}
Data Sharding | 179
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
function getShardAndNodeFromUserId($userId, $common)
{
$shardNo = shardNumber($userId);
$query = "SELECT host,port,sock FROM shard_to_node WHERE shard = %d";
mysql_select_db("common", $common);
$result = mysql_query(sprintf($query, $shardNo), $common);
$row = mysql_fetch_row($result);
$db_server = $row[0] == "localhost" ? ":{$row[2]}" : "{$row[0]}:{$row[1]}";
$conn = mysql_connect($db_server, 'query_user');
mysql_select_db("shard_$shardNo", $conn);
updates to the shard. You can do this either by locking the shard in the application or
by locking tables in the database.
180 | Chapter 5: MySQL Replication for Scale-Out
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Implementing locking in the application requires coordination of all requests so that
there are no known conflicts, and since web applications are inherently distributed,
lock management can become quite complicated very quickly.
In our case, we simplify the situation by locking a single table—the shard_to_node table
—instead of spreading out the locks among the various tables accessed by many clients.
Basically, all lookups for shard locations go through the shard_to_node table, so a single
lock on this table ensures that no new updates to any shard will be started while we
perform the move and remap the shards. It is possible that there are updates in progress
that either have started to update the shard or are just about to start updating the shard.
By locking the shard, any updates in progress will be allowed to finish and any updates
that are about to start just wait for us to release the lock. When the lock on the shard
is released, the shard will be gone, so the statements doing the update will fail and will
have to be redone on the new shard.
You can use the Replicant library to automate this procedure (shown in Example 5-13).
Example 5-13. Procedure for moving a shard between nodes
_UPDATE_SHARD_MAP = """
UPDATE shard_to_node
SET host = %s, port = %d, sock = %s
WHERE shard = %d
"""
_LOCK_SHARD_MAP = """
BEGIN;
SELECT host, port, sock
FROM shard_to_node
WHERE shard = %d FOR UPDATE
"""
tion but are not yet connected to it, or alternatively may have started updating the shard.
The application code has to take this into account. The easiest solution is to have the
application recompute the node if the query to the shard fails. You can assume that a
failure means the shard was recently moved and that it has to be looked up again.
Example 5-14 shows the changes that are necessary to fix the getArticlesForUser
function.
Example 5-14. Changes to application code to handle shard moving
function getArticlesForUser($userId, $common)
{
global $QUERIES;
$query = <<<END_OF_SQL
SELECT author_id, article_id, title, published, body
FROM articles
WHERE author_id = %d
END_OF_SQL;
do {
list($shard, $node) = getShardAndNodeFromUserId($userId, $common);
$articles = array();
$QUERIES[] = sprintf($query, $userId);
$result = mysql_query(sprintf($query, $userId), $node);
} while (!$result && mysql_errno($node) == 1146);
while ($obj = mysql_fetch_object($result))
$articles[] = $obj;
return $articles;
}
Occasionally, as we saw in the previous section where a user suddenly became popular,
it is necessary to move individual items of data between shards as well.
Moving a user is more complicated than moving a shard, because it requires extracting
a user and all his associated articles and comments from a shard and reinstalling them
in another shard. The technique is highly application-dependent, so the ideas we offer
common> COMMIT;
4. Delete the user’s articles and comments from the old shard.
shard> DELETE FROM shard_old.comments
-> USING shard_old.articles, shard_old.comments
-> WHERE article_ref = articles_id AND author_id = UserID;
shard> DELETE FROM shard_old.articles WHERE author_id = UserID;
shard> COMMIT;
In
this case,
it is necessary to keep two connections open: one for the node containing
the common database and one for the node containing the shards. If the shards and
the common database are on the same node, the problem is significantly simplified,
but we cannot assume that.
If the shards are on different databases, the following procedure will solve the problem
in a relatively straightforward way.
1. Create a backup of the articles and comments on the source node and, at the same
time, get a binlog position corresponding to the backup.
To do this, lock the rows for the user in both the articles and comments tables. Note
that to do this, it is necessary to start a transaction similar to the one in which we
updated the shard_to_node table when moving a shard, but here it is sufficient to
block writes, not reads.
Data Sharding | 183
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
shard_old> BEGIN;
shard_old> SELECT * FROM articles, comments
-> WHERE article_ref = article_id AND author_id = UserID
-> FOR UPDATE;
2.
Create a backup of the articles and comments.
shard_old
clearly needs to be avoided or you will risk a bunch of irritated customers.
To avoid getting data that is too old, it is necessary to somehow ensure that the data
provided by the slave is recent enough to be useful. As you will see, the problem be-
comes even trickier when a relay server is added to the mix. The basic idea of handling
this is to somehow mark each transaction committed on the master, and then wait for
the slave to reach that transaction (or later) before trying to execute a query on the slave.
184 | Chapter 5: MySQL Replication for Scale-Out
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
The problem needs to be handled in different ways depending on whether there are
any relay slaves between the master and the slave.
Consistency in a Nonhierarchal Deployment
When all the slaves are connected directly to the master, it is very easy to check for
consistency. In this case, it is sufficient to record the binlog position after the transaction
has been committed and then wait for the slave to reach this position using the previ-
ously introduced MASTER_POS_WAIT function. It is, however, not possible to get the exact
position where a transaction was written in the binlog. Why? Because in the time be-
tween the commit of a transaction and the execution of SHOW MASTER STATUS, several
events can be written to the binlog.
This does not matter, since in this case it is not necessary to get the exact binlog position
where the transaction was written; it is sufficient to get a position that is at or later
than the position of the transaction. Since the SHOW MASTER STATUS command will show
the position where replication is currently writing events, executing this after the trans-
action has committed will be sufficient for getting a binlog position that can be used
for checking consistency.
Example 5-15 shows the PHP code for processing an update to guarantee that the data
presented is not stale.
Example 5-15. PHP code for avoiding read of stale data
function fetch_master_pos($server) {
$result = $server->query('SHOW MASTER STATUS');
if ($result == NULL)
function commit_and_sync($master, $slave) {
if ($master->commit()) {
if (!sync_with_master($master, $slave))
return NULL; // Synchronization failed
return TRUE; // Commit and sync succeeded
}
return FALSE; // Commit failed (no sync done)
}
function start_trans($server) {
$server->autocommit(FALSE);
}
In Example
5-15,
you see the functions commit_and_sync and start_trans together with
the three support functions, fetch_master_pos, wait_for_pos, and sync_with_master.
The commit_and_sync function commits a transaction and waits for it to reach a desig-
nated slave. It accepts two arguments, a connection object to a master and a connection
object to the slave. The function will return TRUE if the commit and the sync succeeded,
FALSE if the commit failed, and NULL if the commit succeeded but the synchronization
failed (either because there was an error in the slave or because the slave lost the master).
The function works by committing the current transaction and then, if that succeeds,
fetching the current master binlog position through SHOW MASTER STATUS. Since other
threads may have executed updates to the database between the commit and the call
to SHOW MASTER STATUS, it is possible (even likely) that the position returned is not at
the end of the transaction, but rather somewhere after where the transaction was writ-
ten in the binlog. As mentioned earlier, this does not matter from an accuracy per-
spective, since the transaction will have been executed anyway when we reach this later
position.
After fetching the binlog position from the master, the function proceeds by connecting
to the slave and executing a wait for the master position using the MASTER_POS_WAIT
transaction.
The second solution, illustrated in Figure 5-11, connects to all the relay servers in the
path from the master to the final slave to ensure the change propagates to the slave. It
is necessary to connect to each relay slave between the master and the slave, since it is
not possible to know which binlog position will be used on each of the relay servers.
Consistency in a Hierarchal Deployment | 187
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Figure 5-11. Synchronizing with all servers in a relay chain
Both
solutions
have
their merits, so let’s consider the advantages and disadvantages of
each of them.
If the slaves are normally up-to-date with respect to the master, the first solution will
perform a simple check of the final slave only and will usually show that the transaction
has been replicated to the slave and that processing can proceed. If the transaction has
not been processed yet, it is likely that it will be processed before the next check, so the
second time the final slave is checked, it will show that the transaction has reached the
slave. If the checking period is small enough, the delay will not be noticeable for
the user, so a typical consistency check will require one or two extra messages when
polling the final slave. This approach requires only the final slave to be polled, not any
of the intermediate slaves. This can be an advantage from an administrative point as
well, since it does not require keeping track of the intermediate slaves and how they
are connected.
On the other hand, if the slaves normally lag behind, or if the replication lag varies a
lot, the second approach is probably better. The first solution will repeatedly poll the
slave, and most of the time will report that the transaction has not been committed on
the slave. You can handle this by increasing the polling period, but if the polling period
has to be so large that the response time is unacceptable, the first solution will not work
188 | Chapter 5: MySQL Replication for Scale-Out
peatedly to see whether the transaction has been executed. This code uses the
Last_Exec_Trans table introduced in Chapter 4 by checking it on the master, and then
repeatedly reading the table on the slave until it finds the correct transaction.
Example 5-17. PHP code for avoiding read of stale data using polling
function fetch_trans_id($server) {
$result = $server->query('SELECT server_id, trans_id FROM Last_Exec_Trans');
if ($result == NULL)
return NULL; // Execution failed
$row = $result->fetch_assoc();
if ($row == NULL)
return NULL; // Empty table !?
$gid = array($row['server_id'], $row['trans_id']);
$result->close();
return $gid;
Consistency in a Hierarchal Deployment | 189
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
}
function wait_for_trans_id($server, $server_id, $trans_id) {
if ($server_id == NULL || $trans_id == NULL)
return TRUE; // No transactions executed, trivially in sync
$server->autocommit(TRUE);
$gid = fetch_trans_id($server);
if ($gid == NULL)
return FALSE;
list($current_server_id, $current_trans_id) = $gid;
while ($current_server_id != $server_id || $current_trans_id < $trans_id) {
usleep(500000); // Wait half a second
$gid = fetch_trans_id($server);
if ($gid == NULL)
return FALSE;
ID and check it once before entering the loop.
• This code requires access only to the master and slave, not to the intermediate relay
servers.
190 | Chapter 5: MySQL Replication for Scale-Out
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Example 5-18 includes code for ensuring you do not read stale data. It uses the tech-
nique of querying all servers between the master and the final slave. This method pro-
ceeds by first finding the entire chain of servers between the final slave and the master,
and then synchronizing each in turn all the way down the chain until the transaction
reaches the final slave. The code reuses the fetch_master_pos and wait_for_pos from
Example 5-13, so they are not repeated here. The code does not implement any caching
layer.
Example 5-18. PHP code for avoiding reading stale data using waiting
function fetch_relay_chain($master, $final) {
$servers = array();
$server = $final;
while ($server !== $master) {
$server = get_master_for($server);
$servers[] = $server;
}
$servers[] = $master;
return $servers;
}
function commit_and_sync($master, $slave) {
if ($master->commit()) {
$server = fetch_relay_chain($master, $slave);
for ($i = sizeof($server) - 1; $i > 1 ; --$i) {
if (!sync_with_master($server[$i], $server[$i-1]))
return NULL; // Synchronization failed
}
will have committed many transactions that the master is not aware of. In a sense,
the master is in an alternative future compared to the rest of the system. An illus-
tration of this situation is shown in Figure 5-12.
• If the master committed a transaction and wrote it to the binary log, then crashed
just after it acknowledged the transaction, the transaction may not have made it
to the slaves. This means the master has one or more transactions that have not
been seen by the slaves, nor by any other part of the system.
If the original master is not too far behind the current master, the easiest solution to
the first problem is to connect the original master as a slave to the current master, and
then switch over all slaves to the master once it has caught up. If, however, the original
master has been offline for a significant period, it is likely to be faster to clone one of
the slaves and then switch over all the slaves to the master.
If the master is in an alternative future, it is not likely that its extra transactions should
be brought into the deployment. Why? Because the sudden appearance of a new trans-
action is likely to conflict with existing transactions in subtle ways. For example, if the
transaction is a message in a message board, it is likely that a user has already recom-
mitted the message. If a message written earlier but reported as missing—because the
master crashed before the message was sent to a slave—suddenly reappears, it will
befuddle the users and definitely be considered an annoyance. In a similar manner,
users will not look kindly on shopping carts suddenly having items added because the
master was brought back into the system.
In short, you can solve both of the out-of-sync problems—the master in an alternative
future and the master that needs to catch up—by simply cloning a slave to the original
master and then switching over each of the current slaves in turn to the original master.
192 | Chapter 5: MySQL Replication for Scale-Out
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
These problems, however, highlight how important it is to ensure consistency by
checking that changes
to a master are available on some other system before reporting
the transaction as complete, in the event that the master should crash. The code that
A knock on his door drew Joel’s attention away from reading his email. He wasn’t
surprised to see Mr. Summerson standing in his doorway.
“Yes, sir?”
“I am getting a little concerned about all this replication stuff we’ve got now. I’d like
you to do some research into what we need to do to improve our knowledge of how it
all works. I want you to put together a document explaining not only the current con-
figuration, but also troubleshooting ideas with specific details on what to do when
things go wrong and what makes it tick.”
Joel was expecting such a task. He, too, was starting to be concerned that he needed
to know more about replication. “I’ll get right on it, sir.”
“Great. Take your time on this one. I want to get it right.”
Joel nodded as his boss walked away. He sighed and gathered his favorite MySQL books
together. He needed to do some reading on the finer points of replication.
Previous chapters introduced the basics of configuring and deploying replication to
keep your site
up and available, but to understand replication’s potential pitfalls and
how to use it effectively, you should know something about its operation and the kinds
of information it uses to accomplish its tasks. This is the goal of this chapter. We will
cover a lot of ground, including:
• How to promote slaves to masters more robustly
• Tips for avoiding corrupted databases after a crash
• Multisource replication
• Row-based replication
195
Please purchase PDF Split-Merge on www.verypdf.com to remove this watermark.
Replication Architecture Basics
Chapter 3 discussed the binary log along with some of the tools that are available to
investigate the events it records. But we didn’t describe how events make it over to the
slave and get reexecuted there. Once you understand these details, you can exert more
control over replication, prevent it from causing corruption after a crash, and investi-