Progress Fathom Replication
User’s Guide


Lost connection

A lost connection, where a Fathom Replication server loses contact with its agents, can occur for a variety of reasons, including:

When a lost connection occurs, the source goes into failure recovery and the target goes into transition.

Detecting TCP/IP communications failures

It is possible that a break in the TCP/IP connection between Fathom Replication server and its agents can go undetected. For example, in a large complex network with a number of bridges and routers, a segment of the network could go down, interrupting the communications between the server host machine and the agent host machine. However, TCP/IP would still be running in other segments of the network and the server or agent might be unaware of the break.

You can ensure that TCP/IP failures are detected by having the server and agent ping each other. If there is no response to the ping, the connection is assumed to be broken and failure recovery begins.

Use the server Repl-Keep-Alive property to enable pinging between the server and the agent. A ping is sent every thirty seconds. The Repl-Keep-Alive property allows you to specify the number of seconds to wait for a response to the ping. If there is no response for the specified period (the default is 300 seconds), a connection failure condition is set and failure recovery begins.

For more information about configuring Repl-Keep-Alive, see the "Server properties" section.

Source failure recovery after losing connection

When the Fathom Replication server loses connection with one or more Fathom Replication agents, the Fathom Replication server tries to contact the Fathom Replication agent and establish connection for an amount of time determined by the connect-timeout value set in the Fathom Replication server properties file.

The Fathom Replication server does the following:

  1. The Fathom Replication server recognizes that there has been an agent failure. It places itself into a state that allows continuous RDBMS activity, as if the Fathom Replication server is not running.
  2. The Fathom Replication server tries to reconnect to Fathom Replication agents for a set amount of time.
  3. Source database activity by clients is still allowed unless synchronous replication is being used or schema updates are being performed by a process.
  4. If the Fathom Replication server is able to reconnect to the Fathom Replication agent, it again begins processing AI blocks from the RDBMS. When it gets within ten AI blocks of the RDBMS, the Fathom Replication server halts normal database activity and completes the synchronization process.
  5. Schema updates are not allowed while the Fathom Replication server is performing synchronization. If schema updates are being performed when failure recovery synchronization begins, source database updates will block until failure recovery completes.

    Source database activity cannot continue without the agent connected when synchronous replication is being used.

  6. When synchronization is completed, the Fathom Replication server reinserts itself back into the AI block write process and the RDBMS will be unlocked allowing normal database activity and replication activity to continue.

If the Fathom Replication server is unable to reconnect to all agents or to a critical agent in the configured connect-timeout period, the Fathom Replication server will terminate and source database activity will continue. In other words, if there are no critical agents, the server must be able to reconnect to all agents or it will terminate. If one agent is a critical agent, the server will continue if it can reconnect to the single critical agent. When source database activity continues while the Fathom Replication server is not running, be sure that there is enough AI extent space to handle all database activity until the Fathom Replication server is restarted and replication continues.

There is a possibility when failure recovery is being performed and synchronization takes place that the Fathom Replication server might not catch up to the RDBMS. During this time, all target databases are not up to date with the source.

Target transition after losing connection

When the Fathom Replication agent loses contact with the Fathom Replication server, the Fathom Replication agent goes into transition. During transition after a lost connection, the Fathom Replication agent listens for the Fathom Replication server in order to re-establish connection, if auto transition is configured, for a set amount of time determined by the transition-timeout value in the Fathom Replication agent properties file. The Fathom Replication agent does the following:

  1. When the Fathom Replication agent first loses contact with the Fathom Replication server, it goes into a pre-transition state where it listens for the Fathom Replication server.
  2. If contact is not established and the agent is configured to perform auto transition, the target database is transitioned to a normal Progress database. A normal Progress database means that all standard client connections and updates can be performed on it.
  3. If manual transition is configured, the Fathom Replication agent continues waiting until the database administrator initiates a change. Until the administrator initiates a change using the DSRUTIL Utility, the database will remain in an unknown state.

For more information about transition, see the "Target transition" section.


Copyright © 2004 Progress Software Corporation
www.progress.com
Voice: (781) 280-4000
Fax: (781) 280-4095