5.27.1 BE Failover

 



 

Overview

 

The Web NMS has two BE servers :

The Primary and Standby BE servers are redundant configurations designed to serve the same functionality. They both have access to the same database. When the primary server fails or is brought down, the standby server takes over the functions that were being performed by the primary. The primary server may be brought down for scheduled maintenance.

 

The standby server of Web NMS offers warm standby support. Any operation or request in the Network during the intervening period (i.e., the time period between the failure of primary and the subsequent complete take over by standby) will be lost.

 

Working Mechanism

 

What does Primary Server do ?

 

In failover setup, the primary and standby servers should have access to the same database. In the database, details regarding the primary and secondary servers are maintained in a table named BEFailOver. At a specified regular time interval, the primary server updates the BEFailOver table about its presence with a symbolic count. With every update the count gets incremented. This count is known as LASTCOUNT. The periodic interval at which the primary has to update the database regarding its presence is known as HEART_BEAT_INTERVAL. If you specify 30 seconds as the value for HEART_BEAT_INTERVAL, the primary will update the BEFailOver table at the interval of 30 seconds. This interval is configurable.

 

What does Standby Server do ?

 

When the primary server is running, if you start the other server (standby), it tries to register with the primary server. If no standby server is already registered, the primary server registers this as the standby server. At any time, only one primary server and one standby server can be configured. If you try to start a second standby server, the primary server will refuse registration. When the primary server registers a standby server, it makes an entry regarding the registration in the database.

 

Similar to the primary, the standby server updates the BEFailOver table about its presence at a specified periodic interval (HEART_BEAT_INTERVAL) in the LASTCOUNT which gets incremented with every update. The primary server monitors the LASTCOUNT of standby server at the specified time interval. When the standby fails to update the LASTCOUNT, the primary assumes that the standby had failed and it cancels its registration as well as its entries from the BEFailOver table. This would enable us to connect a new standby server.

 

What happens when the Primary Server fails ?

 

When the primary server fails, it fails to update the LASTCOUNT. The standby server keeps monitoring the primary's LASTCOUNT at a specified periodic interval known as FAIL_OVER_INTERVAL. Supposing, you have specified FAIL_OVER_INTERVAL as 50 seconds, the standby will monitor the primary's LASTCOUNT for every 50 seconds. Everytime, when the secondary server looks up the LASTCOUNT, it compares the previous and present counts. When the primary server fails to update the LASTCOUNT, consecutive counts will be the same and the standby assumes that the primary had failed. Here, a parameter named RETRY_COUNT comes into play which enables us to specify the number of times the standby has to check the primary's LASTCOUNT (when the primary fails to update the LASTCOUNT) before assuming that the primary had failed.

 

Once it finds that the primary had failed, it immediately changes its mode as PRIMARY and assumes all the functions that were being performed by the hitherto active server. Now, the Front End Server switches over its connection to the Standby server. Clients connected to the FEs continue to receive service without any knowledge about the Failover. Hence, the clients connected to the Web NMS front end servers are not affected by this failover process, except during the brief failover interval itself, when they may get some timeouts.

 

The following Time Line diagram depicts the flow of the complete happenings in the failover process.

 

 

Note: In order to setup successful BE Failover, it is necessary that both the primary and the secondary server machines are DNS enabled.

 

 

How to setup BE Failover ?

 

Assume that you have BE servers installed in two machines namely Machine 1 and Machine 2. You want to start the BE server in Machine 1 as primary and the BE server in Machine 2 as standby. Then you can do the same by the following steps:

  1. Start the BE server in Machine 1.

  2. Copy the hibernate.cfg.xml file of the BE server (in Machine 1) and replace it under the <Web NMS Home>/classes/hbnlib directory of the BE Server in Machine 2.

  3. Modify the DB host name entry from localhost to the hostname where the DB server is running. (in case the BE Failover is setup in a different machine)

  4. Start the BE server in Machine 2.

The above steps would successfully set up the BE failover environment. When you start the Primary Server, an entry is made in the BEFailOver table with the following information about the Primary Server: HOSTADDRESS, NMSBEPORT, RMIREGISTRYPORT, LASTCOUNT, SERVERROLE, and STANDBYSERVERNAME. When another BE Server is started with the same database parameters (i.e. with the same hibernate.cfg.xml file) as the primary server, the server checks if the BEFailover table has any entry with SERVERROLE as primary. If such an entry is found, then it registers itself as a standby server to the primary server.

 

Now, the primary server will run in Machine 1 and the standby server will run in Machine 2. When the primary server fails, the standby server will take over the tasks of the primary server automatically.

 

 

Note: Until Web NMS 4.7.0 SP1, users must explicitly specify the BE server mode, (PRIMARY or STANDBY) in which the BE server should be started, in the <Web NMS Home>/conf/FailOver.xml.

 

From Web NMS 4.7.0 SP2 onwards, when a BE server is started, Web NMS framework will directly read the database (instead of reading the <Web NMS Home>/conf/FailOver.xml) to know if there is already a PRIMARY server running or not. If there is already a Primary server running then the current BE server will be started in STANDBY mode otherwise the current BE server will be started in PRIMARY mode.

 

 

Notification Mechanism

 

In the failover process, the primary server failure and the subsequent take over by standby are the two very important happenings and it is necessary that all the modules be notified of these events. The notification mechanism achieves this.

 

When the primary server fails, the  Standalone FE detects the communication break and it immediately sends a notification regarding the failure to all the modules which have registered for receiving such a notification. This is known as preBEFailOverNotification. Similarly, it sends another notification soon after the standby server had successfully taken over and all the connections had been restored. This is known as postBEFailOverNotification.

 

For further information about the BEFailOverListener interface, please refer to the javadocs.

 

How to Register as Failover Listener ?

 

The modules which desire to receive preBEFailOverNotification and postBEFailOverNotification should register themselves as FailOverListeners. For registering as listeners, the modules should implement the com.adventnet.nms.fe.common BEFailOverListener interface.The following code shows how a module can register itself as a failover listener for receiving notifications.

  1. public class TestAPIProxyImpl implements BEFailOverListener

    {

    public TestAPIProxyImpl()
    {

    PureServerUtilsFE.clientSocketFE.registerBEFailOverListener(this);

    }

    public void preBEFailOverNotification(BEFailOverEvent event)

    {

    .......

    }>
    public void postBEFailOverNotification(BEFailOverEvent event)
    {

    .......

    }

    }

Example

 

A Reference Implementation explaining how to setup FailoverListener is available in the examples section.

 

How to De-register ?

 

If any module has to be disabled from receiving the notification mechanism, it can be done. The following code snippet shows how this can be achieved.

  1. // Removes FailoverListener for a module

    PureServerUtilsFE.clientSocketFE.deRegisterBEFailOverListener(this);

Role of Failover.xml

 

The <Web NMS Home>/conf/FailOver.xml contains user inputs to be passed to the failover framework. Users can use this FailOver.xml file to pass their inputs to the Primary server and the Standby server. The general structure of the FailOver.xml is as below.

 

  1. <FAILOVER>

    <PRIMARY HEART_BEAT_INTERVAL="60" />

    <STANDBY

    FAIL_OVER_INTERVAL="60"

    RETRY_COUNT="1">

    <BACKUP

    ENABLED="TRUE"

    BACKUP_INTERVAL="600" >

    <INCLUDE>

    <DIR NAME="myDir1"/>

    <FILE NAME="myFile1"/>

    </INCLUDE>

    <EXCLUDE>

    <DIR NAME="myDir2"/>

    <FILE NAME="myFile2"/>

    </EXCLUDE>

    </BACKUP>

    <SEND_EMAIL

    SMTP_SERVER="mail-server1"

    TO_ADDRESS="xyz@webnms.com"

    FROM_ADDRESS="webnms@webnms.com"

    SUBJECT="Web NMS Primary Server Failed"

    BODY="The Web NMS Back End Server is failed and taken over by the Hot Stand By Server"/>

    </STANDBY>

    </FAILOVER>

Note: Until Web NMS 4.7.0 SP1, FailOver.xml will contain either the PRIMARY entries or the STANDBY entries, not both. If the FailOver.xml contains both the primary and standby entries then the server will not be started. Hence when the role of the server changes from PRIMARY to STANDBY or STANDBY to PRIMARY, FailOver.xml has to be changed manually to pass the inputs required for the current mode / role of the server.

From Web NMS 4.7.0 SP2 onwards, FailOver.xml can contain entries for both primary and standby servers. Inputs relevant to the current mode of the server will be used by the failover framework. Hence the user does not need to modify the FailOver.xml manually, whenever the role of the server changes.

 

To enhance the backup operations done during the failover, new entries have been introduced in the FailOver.xml which are explained in the Configuration Options section. The new entries introduced in Web NMS 4.7.0 SP2 are in bold letters in the FailOver.xml entries given above.

 

 

 

Configuration Options

 

For Primary Server

 

  1. Parameter What does the parameter specify?

    Default value

    HEART_BEAT_INTERVAL (in seconds)

    This parameter specifies the periodic time interval during which the primary server keeps updating the LASTCOUNT in the BEFailOver table in the database.

     

    60 seconds

    FAIL_OVER_INTERVAL (in seconds)

    The primary keeps monitoring the standby's LASTCOUNT at a specified periodic interval known as FAIL_OVER_INTERVAL. 

     

    60 seconds

For Standby Server


  1. Parameter What does the parameter specify?

    Default value

    FAIL_OVER_INTERVAL (in seconds)

    The standby server keeps monitoring the primary's LASTCOUNT at a specified periodic interval known as FAIL_OVER_INTERVAL. 

     

    60 seconds

    RETRY_COUNT

    This parameter specifies the number of times the standby has to check the primary's LASTCOUNT (when the primary fails to update the LASTCOUNT) before assuming that the primary had failed.

     

    1

    HEART_BEAT_INTERVAL (in seconds)

    This parameter specifies the periodic time interval during which the standby server keeps updating the LASTCOUNT in the BEFailOver table in the database.

     

    30 seconds

This Failover setup provides options for taking backup of the configuration files when failover occurs and to send e-mails to anyone regarding the occurance of failover. You may configure these through the following parameters in Failover.xml file.

  1. Note: It is recommended to have the HEART_BEAT_INTERVAL for the PRIMARY server and the FAILOVER_INTERVAL of the STANDBY server above 20 seconds. Having lesser values may lead to unexpected failover.

Configuring Backup

 

The following parameters provide the option for taking a backup of the configuration files from the primary server and carry them over to the standby to make both the configuration files in sync with each other.

  1. Parameter What does the parameter specify?

    Default value

    ENABLED (true/false)

    If set to "true" this backs up the configuration files. If both the primary and the standby Servers share the same conf file mounted in a common system, this configuration can be set to "false".

    true

    BACKUP_INTERVAL (in seconds)

    This parameter is used to specify the  periodic time interval during which the configuration files have to be backedup.

    600

Including or Excluding directories during back up process.

 

From Web NMS 4.7.0 SP2 onwards, provision has been provided for the users to configure the directories to be taken back up. Users can specify the directories that should be taken back-up during the fail over. For this a new tag <INCLUDE> is introduced in the FailOver.xml. Users can also specify the directories that should not be taken back-up during failover. For this a new tag <EXCLUDE> is introduced in the FailOver.xml. Combining the INCLUDE and the EXCLUDE tags, users can include a parent directory and filter the unwanted directories and files under this parent directory. To specify directories and files new tags <DIR> and <FILE> are introduced. Below are the explanations for these new tags.

  1. Tag

    What does the tag specify?

    DIR

    Directory to be included or excluded, can be specified using this tag. Name of the directory should be given in the NAME attribute of this tag. Directory name should given relative to the Web NMS Home.

    FILE

    File to be included or excluded, can be specified using this tag. Name of the file should be given in the NAME attribute of this tag.

    File name should given relative to the Web NMS Home.

    INCLUDE

    Include tag contain DIR and FILE tags. The directories specified inside this tag, will be taken back-up during the failover process.

    EXCLUDE

    Include tag contain DIR and FILE tags. The directories specified inside this tag, will not be taken back-up during the failover process.

Configuring E-mail Action

 

This following parameters give us the option to send e-mails about the failover process to any listener.

  1. Parameter What does the parameter specify?

    Default value

    SMTP_SERVER

    The SMTP mail server address

    ---

    FROM_ADDRESS

    Sender's e-mail address.

    ---

    SUBJECT

    Subject of the mail.

    ---

    BODY

    The message which you wish to convey

    ---

  1. Note: There is no constraint on having any value for HEART_BEAT_INTERVAL or FAIL_OVER_INTERVAL. However, it is always advisable to give a greater value for FAIL_OVER_INTERVAL compared to the HEART_BEAT_INTERVAL. If the values for both of them are equal, if there occurs a delay on the part of the primary to update the LASTCOUNT during which time the standby, upon finding that the primary had not updated the LASTCOUNT, would presume that it had failed and would try to assume the role of PRIMARY.

 

How to use Virtual IP in the Failover Setup

 

In the WebNMS current failover functionality, WebNMS uses two IpAddresses for the primary and standby servers. The monitored devices send traps to both these IpAddresses, so as to ensure that all traps from them are received processed properly. This increases the load and network traffic. High availability using a virtual ipAddress eliminates these disadvantages. In this setup, whenever the primary server starts, it will assign the configured virtual ipAddress to its machine's MAC address. When the primary fails, the standby server would reassign the virtual ipAddress to its machine's MAC address. Whenever the virtual ipAddress is reassigned, a gratuitous ARP would be sent so as to inform the change in MAC mapping. As a result the device needs to communicate with the virtual ipAddress alone.

  1.  

    Note: As system commands are used for various purposes for achieving this high availability feature, they should be installed in the server machines before using the same. The following system commands are used:

     

    ifconfig  :  net-tools package

    ping       :  iputils package

    arpsend :  http://www.net.princeton.edu/software/arpsend

     

 

Configurations :

 

By default the feature is disabled in WebNMS, to enable the feature the below entry must be added in <WebNMS Home>/conf/serverparameters.xml file.

 

  1.  

    #VIP Configurations

    ENABLE_VIRTUAL_IP_CREATION true

    #VIP_IMPLEMENTATION com.webnms.nms.ha.MyVIPImplementation

     

 

The default implementation of HighAvailabilityInf is used when ENABLE_VIRTUAL_IP_CREATION is set to true, and the VIP_IMPLEMENTATION is not provided. You can write your own class by implementing the HighAvailabilityInf. interface, and provide the classname entry against this parameter.

 

The various parameter configurations needed for creating virtual ipAddress must be done in the <WebNMS Home>/conf/VIPConfiguration.xml file. The following parameters must be configured.

 

  1. Parameter What does the parameter specify?

    Interfaces

    The interface to which the virtual ipaddress is to be bound.

    Virtual IP

    The virtual ipaddress that is to be assigned to the interface.

    Netmask

    The netmask of the virtual ipaddress. If it is not specified the default netmask 255.255.255.0 would be used.

    ARP Command

    Optional parameter for specifying the custom command for sending gratuitous ARP. If it is not specified, the default ARP command using arping would be used.

    Retries

    The number of retries done the by server health monitor before declaring the server as dead

    Heart Beat Interval

    The time interval between successive retries in seconds.

    Class Name

    Optional parameter for specifying custom server health monitor implementation. If it is not specified the default health monitor implementation would be used.

     

    The purpose of this health monitor is to be delete the virtual ipaddress in case of improper shut down of primary server

 

Limitations

 

Configuration using Web NMS EclipsePlugin

 

Import the file FailOver.xml from <Web NMS Home>/conf directory into the Eclipse Project. Open the file in the project. Edit the file and click Save to save the changes.

 

Refer the section Working with Files in Eclipse Guide for more details on how to import  WebNMS configuration Files to Eclipse.

 

 

How to disable BE Failover ?

 

In case, you do not need BE FailOver mechanism and wish to disable it, you can do so as stated below :

 

In the startnms.bat/sh file present under <Web NMS Home>/bin directory, add the following entry as a command line argument while calling the NmsMainBE.

 

NMS_BE_FAILOVER false

 

Hot Standby Fault Failover

 

The Primary and Standby BE servers are redundant configurations designed to serve the same functionality. They both have access to the same database. When the primary server fails or is brought down, the standby server takes over the functions that were being performed by the primary. The primary server may be brought down for scheduled maintenance.

 

The standby server of Web NMS offers warm standby support.  Any operation or request in the Network during the intervening period (i.e., the time period between the failure of primary and the subsequent complete take over by standby) will be lost. Also during this failover period, the critical traps/notification that are sent from the Agents could be missed. To avoid the loss of notifications you can enable the hot standby fault failover in standby server.

 

For more information on its working mechanism and setup procedure, refer to the Hot Standby Fault Failover README located in the <Web NMS Home>/default_impl/failover directory.

 

Developer Tips

more...

 


Copyright © 2013, ZOHO Corp. All Rights Reserved.