Replication in AMGA

This section presents the Replication mechanisms that are part of AMGA since version 1.2.

Overview

In this section we provide the technical background necessary to understand replication in AMGA. Reading this section is highly recomended for anyone interested in using this feature. Further details can be found in the following article, available on the AMGA web page:

N. Santos and B. Koblitz, Distributed Metadata with the AMGA Metadata Catalog, In Workshop on Next-Generation Distributed Data Management - HPDC-15, Paris, France, June 2006

Some more background is also explained here.

Features Overview

Replication in AMGA follows an asynchronous, master-slave model, and supports partial replication of the directory hierarchy.

In master-slave replication only one of the replicas, the master, is writable, with all other replicas being always read-only. This model is sufficient for all applications with read-only metadata or where the metadata is written only at a single geographical location, which covers most of the Grid applications.

Slaves can replicate any sub-tree of the metadata hierarchy. This allows slaves to copy only the data they are interested on, reducing the load both on the slave and on the master, as well as the bandwidth requirements. If desired, a full replica of the master can be obtained by replicating the root directory.

Concepts

Next is the description of the main concepts used in AMGA Replication.

Master: Any node that exports part of its metadata hierarchy for replication.
Slave: Any node that replicates metadata from a master. A node can be both a slave and a master at the same time, or neither if working standalone.
Replication log: Shipped from a master to a slave, containing a metadata command plus some context information that allows the slave to replay the original command executed at the master.
Subscriptions: To replicate a directory, the slave creates a subscription for that directory with the master that will allow it to receive the replication logs for that particular directory.
Mount: On the slave side, the root of the sub-tree subscribed from the master is called the mount. There is a mount for each subscription and it corresponds the the path of the directory that is replicated. Each mount contains also all the sub-directories of the mounted directory.
Initial Synchronization: When a slave subscribes for the first time to a directory, it must copy the existing data for which there are no replication logs.

Architecture Overview

In AMGA replication, the master ships to the slaves logs containing the metadata commands executed at the master, much in the same way as it is done in Oracle Streams and MySQL replication. When updated by a client, the master saves a replication log containing the command and some context information to the logs table on its database backend. A separate program, the replication daemon, queries the database periodically looking for new logs and shipping them to the slaves with subscriptions to the directories updated by the command of the log. This replication daemon is also responsible for managing the list of slaves that are replicating from the master. At the slave, the logs are replayed to update the slave's metadata. The logs contain only metadata commands, and they are totally independent of the underlying database backend.

Logs are assigned a unique sequence number at creation time, the log xid. Each AMGA instance generates its own identifiers in an independent way, so the xid is unique only inside an AMGA instance. The xid is necessary for synchronization between the slave and the master. For each directory replicated, the slave keeps the xid of the last log it received and applied, so that it knows from what point to resume after connection failures or shutdowns. The master also keeps track of the subscriptions by storing them persistently on its database backend, on the subscribers table. For each subscription, it stores the most recent xid that the slave has acknowledge. This is necessary to know when there are no more slaves waiting for a particular log, so it can be removed from the logs table.

Replication Daemon

The bulk of the operations necessary for replication on master nodes are implemented outside the amgad process, by the mdrepdaemon program, also called the replication daemon. The amgad is only responsible for writing the replication logs to the database, while the replication daemon does everything else related to replication, including:

Managing subscriptions
Accepting replication requests from slaves
Sending the initial snapshot of replicated directories to slaves
Polling the logs table and shipping logs to the interested subscribers
Cleaning up the logs table, removing any unnecessary logs.

The replication daemon is independent of the AMGA server it is working for, and needs only to connect to the database backend used by the AMGA server. Apart from that, there is no communication between them and they can run separately, even on different machines for better load balancing.

Operation

Nodes interested in replicating from another node must subscribe to the directories they wish to receive by contacting the replication daemon of that node. After connecting, the slave informs the master of the directories it is interested on, and begins copying the contents of the database. This is done using the dump feature of AMGA, which generates the commands that must be executed on another AMGA instance to recreate a directory hierarchy. The replication daemon internally executes a dump and forwards the commands to the slave, that replays them. Each directory is shipped using a database transaction to isolate it from updates that may be happening concurrently. It is also tagged with the xid of the last log generated for that directory before the synchronization started, so the slave will know from what point to start receiving logs after finishing the initial synchronization. Updates generated during the synchronization will be saved as logs and shipped to the slave after the synchronization is done.

The initial synchronization might be a lenghty process and currently there are no provisions for resuming from failed synchronizations. Nevertheless, if the synchronization is interrupted, the slave will reestablish a consistent state by discarding all the information received. In the future, we plan to implement mechanisms to allow resuming partial synchronizations.

After having the initial snapshot, the slave can start receiving and applying logs. In the current implementation the slave connects to the master using a TCP connection, sends the xid of the earliest log that it wants to receive and waits for incoming logs. When the slave receives a log, it executes the log locally, and after making sure that the log is safely committed to the database, it sends an acknowledge to the master. After receiving the acknowledge, the master is free to delete the log. To ensure good performance over high-latency connections, this communication between the master and the slave is asynchronous, that is, the master sends the logs without waiting for acknowledges of previous logs, and polls the socket periodically with a non-blocking operation looking for incoming acknowledgments.

The replication daemon stores the information about the subscriptions in the local database in order for it to survive eventual crashes of the master node. This information includes the slave's id, address, directories subscribed, and the id of the last log acknowledged.

Logs must be deleted when they are no longer needed. This is done also in the replication daemon, by the same thread that monitors the logs. After polling for new logs, shipping them to subscribers, and polling for new acknowledges from clients, this thread goes over the logs table and deletes the logs that are no longer needed by any subscriber. Under normal conditions, that is, when all subscribers are connected, logs are deleted shortly after being generated. A log is kept for a longer time only when a subscriber is disconnected.

To tolerate failures resulting in disconnections of the subscribers, subscriptions are persistent, in the sense that if a subscriber disconnects without having first requested to be unsubscribed, the master will preserve the subscription and continue saving logs for the directories subscribed by the slave. When the subscriber reconnects, the subscription is resumed from the point it was interrupted. Currently, there is no provision for dealing with slaves that are disconnected for too long. In this case, the logs will accumulate on the logs table, eventually causing problems on the database backend. In the future, we plan to implement mechanisms for controlling the growth of the log table by removing old subscriptions when certain conditions are met, like the log table exceeding a maximum size or a subscriber being disconnected for too long.

Setup

Setting up a Master node

A master node consists of an amga daemon configured to save replication logs on the database plus an associated replication daemon. To simplify the setup, both processes use exactly the same format for the configuration file, allowing them to share the same configuration file.

The amgad is configured to save logs by setting the following property on the configuration file:

  EnableMaster = 1

As an optimization, logs are saved only if there is at least one slave interested on the directory updated by the log, so if there are no subscriptions, the cost of having replication enabled is only an extra database query (to look for active subscriptions).

The replication daemon must be configured to point to the same database as the amga daemon. The simplest way of doing this is to reuse the same configuration file for both programs, although it is possible to have separate files. Additionally, the replication daemon needs to be assigned to a port where it will listen for incoming connections. This is configured by the Port property on the Replication section of the configuration file:

  [Replication]
  ...
  ReplicationDaemonPort=8823

This is all that needs to be configured for the replication daemon. After this, it can be started with:

$ mdrepdaemon [-c <amgad.config>]

The accepted options are:

 -p <port>        : listen port
 -d               : Activate debug output. Very verbose.
 -c <configFile>  : configuration file

If no options are given, it will look for amgad.config in the same way as amgad does.

After starting up, the replication daemon waits for connections from slaves and polls the database periodically looking for new logs. It will also go over the the logs table periodically to delete logs that are no longer needed.

Setting up a Slave node

The slave functionality is implemented fully on the mdserver, and therefore, there is no need to run an external program like the replication daemon. To activate replication as a slave, the following setting must be enabled on the configuration of the mdserver:

  EnableSlave = 1

Activating the slave mode enables the rep_* commands for the slave, and has no impact on the performance.

A slave node needs also to be configured with the settings of master nodes that it might be required to contact. This configuration is defined on a per-node basis, which is done through the site_* commands, which can be used to define sites and set their parameters like the login name the slave will use or the port of the replication daemon to contact on this site. The minimum configuration necessary is (pointing to a replication daemon listening on port 8832 and logging into it as root):

Query> site_add SiteA localhost:8832
Query> site_set_properties SiteA login root

Please consult Section Replication for further information on how to define the master nodes and Site management for how to configure sites. Once configured as a slave node, the server is started normally. There are no special command line options related to replication.

On startup, the server will consult its database backend to check if there are any subscriptions on the active state. If so, it will try to restart receiving logs from these masters, by reopening a TCP connection to each master and asking for the logs after the last one it received. If the connection can't be established immediately, it will continue to try, waiting 60 seconds between attempts, until either it succeeds or the subscriptions is stopped using the rep_stop_receive command.

Security

Replicating Security Information

There are two types of security information managed by AMGA:

permissions and acls - This is associated with the metadata.
groups and users - Global to the catalogue instance, and not associated directly with the metadata.

Both types of information can be replicated in AMGA. The first is replicated as part of the metadata replication. The rep_mount command by default imports this information from the master. In this case, the permissions and owner of the metadata on the slave will be exactly the same as in the master and will be kept synchronized with the master. This behavior can be disabled by specifying the -noperms option, so that the slave replicates the metadata without any security information. The resulting replica will be created using the security settings in effect at the slave at the time of replication. After a directory being mounted in a slave, it is not possible to change its configuration concerning replication of permissions and acls.

Groups and users are replicated on their own, using the commands rep_mount_users and rep_umount_users. They are handled like a virtual mount, in the sense that they have to be synchronized initially by executing rep_mount_users and will be updated after establishing a connection to the master with rep_start_receive. The hash of the password of the user is also replicated. This allows the slave to authenticate the replicated users locally. A slave can only replicate users and groups from a single master and it should have no local users or groups of its own. The exception is the root user and any group owned by root, which are always considered local users and are not replicated.

Security at Slave Nodes

Only the root user is allowed to initiate or to control replication, that is, to execute the rep_* commands. For any other users, the replicated directories will be readable depending on their permissions and access control lists, which may or may not have been replicated from the master depending on the options given to rep_mount. A replicated directory is always read-only, regardless of the write permission or access control right.

Connections from the Slave to the Master

Connections from slaves to the replication daemon use a similar protocol as the ones from the clients to the mdserver, including the support for authentication and encryption. Therefore, the connection settings (e.g., use SSL, authenticate with password, certificates or grid proxies,...) used by the mdrepdaemon can be configured in the same way as the mdserver. In fact, it's even possible to use the same configuration file, in which case the connection settings will be the same. To have different settings, it's enough to specify different configuration files using the command line options of each program.

When connecting to the replication daemon, slaves may need to authenticate. Once again, this is done just like if the slave were a client connecting directly to the master. The replication daemon will accept the same credentials and users as the mdserver of that node. The slave node is configured in the configuration file of the slave's mdserver, on the [Replication] section. The slave can specify several different master nodes, with different connection settings for each. It is possible to check the list of known nodes at a slave during an interactive client session using the rep_list_nodes command.

Granting the replication right on the master

The master controls what groups can replicate which directories by granting the replication right. This right is granted to groups, allowing them to replicate the specified directory, including the full sub-tree rooted on it and all their entries. The right is granted using an interactive session to the master with the commands rep_allow and rep_disallow.

An important point about this right is that for replication it is the only access control performed by the master. This is easier to illustrate with an example. Suppose that user joe does not have permission to read or write directory /jobs when connected as a client, but has the replicate permission over that directory. Then, if a slave connects to this master authenticating as joe it will be able to replicate the full contents of /jobs. Once this information is on the slave, it is completely exposed to the slave's administrator. If the administrator is trustworthy, he will allow the system to enforce the original access permissions. But malicious administrators can easily expose the information to any user.

Limitations

Concurrent Updates on Master

When two or more clients are updating the master concurrently, there is a small probability that the slaves become inconsistent with the master. For this to happen, the clients must be writing (updating, deleting or inserting) to the same collection and to intersecting sets of rows. This is a concurrency problem, so it will happen randomly depending on the interleaving of the execution of two or more concurrent commands. This problem can be avoided in two ways:

Prevent concurrent updates to the same directory.
Set the transaction isolation mode of the database to serializable.

The second solution is generic but has a high overhead on the database backend. Another disadvantage is that the database backend will abort the commands that cannot be serialized due to conflicts with other concurrent clients, and this will create a new failure mode that clients might not be expecting.

Technical Details

The replication mechanisms of AMGA are based on the assumption that it is possible to order the write commands received by the master into a sequence that if replayed in the slave will produce the exact same results. But this is not necessarily true, depending on the transaction isolation level used by the database back-end. Most database systems use an isolation mode (Read Committed) that allows transactions to interleave in a way that makes then non-serializable. This is, there is no ordering of transactions that if replayed sequentially on a different database would produce the same results. This basically means that with the default isolation levels used by databases, there is no way of guaranteeing consistency between the master and the slave in all possible situations. Databases support stricter isolation levels, including a Serializable level. This mode will provide the guarantees needed by AMGA, but has a high overhead and therefore should be used only when the server needs to support concurrent updates.

Tutorial - Setting up a Master to Single Slave Setup

This section provides a step by step tutorial of how to prepare a simple replication setup. It will show how to setup replication from a master to a single slave. These are the names of the hosts that will be used:

master - gridpc1.cern.ch, port 8822
slave - gridpc2.cern.ch, port 8822

The tutorial assumes that the master and the slave nodes have already valid AMGA instances configured and operational, but with replication disabled. On the master, we assume the following directory hierarchy:

    \users
    \files
    \files\2005
    \files\2006

We assume that the slave is interested in replicating the \files\2006 directory.

Configuring the master

Configure the AMGA instance at gridpc1 to act as a master and configure the port where the replication daemon will listen. This is done on the mdserver.config file, with the following properties:
```
    [Replication]
    EnableMaster = 1
    ReplicationDaemonPort=8823
```

Setup authentication and security for connections from slaves. For this tutorial we will require no security and accept plain text connections. Edit amgad.config at the server adding the following lines to the main section:
```
    UseSSL = 0
    RequireAuthentication = 0
```

Configure access permissions to replication. This is done by connecting to the mdserver instance of the master over a normal interactive session using mdclient. First we create a group for users allowed to replicate. We will call it replicators, but any group name will do:
```
    $ mdclient gridpc1
    Connecting to localhost:8822...
    ARDA Metadata Server 1.2.0
    Query> grp_create replicators
```

Now we must add to this group the users allowed to replicate. For this tutorial, we will create a new user called joe but any existing user can be used:
```
    Query> user_create joe
    Query> grp_adduser replicators joe
```

Add the replicators group to the list of groups allowed to replicate the /files directory.
```
    Query> rep_allow /files replicators
```
This will allow the replicators group to replicate all the contents and subdirectories of the /files directory.

Back at the shell, start the replication daemon:
```
    $ mdrepdaemon -c amgad.config
```

Configuring the slave

Now we are going to configure the AMGA instance at gridpc2 to act as a slave. We start by editing the mdserver.config file.

Activate the slave functionality:

    [Replication]
    ...
    EnableSlave = 1
    ...

Configure the list of master nodes. In this case, it's only a single master, you need to be root to set up the sites:

    Query> site_add GridPC1 gridpc1.cern.ch:8823
    Query> site_set_properties GridPC1 login joe
    Query> site_set_properties GridPC1 use_ssl 0
    Query> site_set_properties GridPC1 authenticate_with_certificate 0

Note that only the login property is mandatory, the other settings are the default, anyway.

We can now start the AMGA server on the slave and connect to it using the mdclient. The rest of the tutorial is done from inside the mdclient shell.

Create the parent directory of the directory we are going to mount.
```
    Query> createdir /files
```
The slave expects this directory to exist, otherwise it will fail. We should also make sure that there is no directory with the same name as the one we are going to mount, in this case /files/2006.

Mount the directory.
```
    Query> rep_mount GridPC1 /files/2006
```
The slave connects to the master, subscribes to this directory and copies its contents. When the command finishes executing, we should have a local copy of the remote directory. We can check the status of the mount:
```
    Query> rep_list_mounts
    >> GridPC1:/files/2006 - 23628, Disconnected
```
The Disconnected state means that we are currently not receiving logs. The number before the state is the xid corresponding to the position of our local snapshot on the log sequence.

Start receiving logs. The slave now must connect to the master and wait for logs. This is done with the command:
```
    Query> rep_start_receive GridPC1
    Query> rep_list_mounts
    >> gridpc11:/files/2006 - 23645, Receiving
```
The slave will keep the connection to the Master open until it is shutdown or until the we stop the subscription. If the slave is restarted, it will resume automatically the subscriptions that were active when it was shutdown. It will also try to reconnect if the connection fails for some external reason.
Removing the subscriptions. If we want to cancel the subscription we have first to stop receiving the logs and them unmount the remote directory:
```
    Query> rep_stop_receive GridPC1
    Query> rep_umount /files/2006
    Query> rep_list_mounts
    Query>
```

Generated on Mon Apr 16 13:59:18 2012 for AMGA by

1.4.7