OpenData - About - What is Failover? How does it work?

Failover

Systems with two or more OpenData EE Servers and a remote database have the ability to provide seamless failover if the active server fails. Each server runs three non-database services: the Message Broker, the Event Manager, and the web application. The following illustration shows the key application components in a simplified configuration.

The following sections describe the stages of interruption and recovery that take place during a failover. In real time, the changes happen so quickly that service is seamlessly continued throughout the failure.

Failover Initial State

In the initial state, the device gateway is reporting its status through its connection to the primary server, and the Event Manger Watchdog Timer is running for each individual virtual device. Every time a message of any kind is received from a device, its timer is reset. After 15 minutes with no messages, the timer runs out and the device is marked as being offline.

Failover Stage Two

In stage two, the connection between the device gateway and the primary server has been broken. In the following illustration the failure is in the network connection and the primary server is still running, so the watchdog timers begin counting down. The Device Gateway immediately attempts to connect to the secondary server, and the Event Manager on the secondary server responds by collecting the latest status information from the database about the devices. Here you can see the importance of having the database situated remotely from the primary server, because if the server itself went down, any data stored on it would be unavailable to the secondary Event Manager.

Failover Stage Three

In stage three, the primary server has been disconnected from the Device Gateway for more then 15 minutes and its watchdog timers have timed out. Before marking the devices as offline, the primary server sends a message to the Device Gateway and waits for a response. If there is no response, then it checks the database to see if there is current data being entered. The secondary server has taken over all the tasks usually performed by the primary server, so the primary server sees the current data and resets its timers.

NOTE: that the Device Gateway does not automatically return to the primary server when the connection is restored. The secondary server continues to operate until a failure forces its return to the primary. You can trigger this manually by temporarily shutting down the backup server.