|
The Web NMS has two BE servers :
Primary BE
Standby BE
The Primary and Standby BE servers are redundant configurations designed to serve the same functionality. They both have access to the same database. When the primary server fails or is brought down, the standby server takes over the functions that were being performed by the primary. The primary server may be brought down for scheduled maintenance.
The standby server of Web NMS offers warm standby support. Any operation or request in the Network during the intervening period (i.e., the time period between the failure of primary and the subsequent complete take over by standby) will be lost.
What does Primary Server do ?
In failover setup, the primary and standby servers should have access to the same database. In the database, details regarding the primary and secondary servers are maintained in a table named BEFailOver. At a specified regular time interval, the primary server updates the BEFailOver table about its presence with a symbolic count. With every update the count gets incremented. This count is known as LASTCOUNT. The periodic interval at which the primary has to update the database regarding its presence is known as HEART_BEAT_INTERVAL. If you specify 30 seconds as the value for HEART_BEAT_INTERVAL, the primary will update the BEFailOver table at the interval of 30 seconds. This interval is configurable.
What does Standby Server do ?
When the primary server is running, if you start the other server (standby), it tries to register with the primary server. If no standby server is already registered, the primary server registers this as the standby server. At any time, only one primary server and one standby server can be configured. If you try to start a second standby server, the primary server will refuse registration. When the primary server registers a standby server, it makes an entry regarding the registration in the database.
Similar to the primary, the standby server updates the BEFailOver table about its presence at a specified periodic interval (HEART_BEAT_INTERVAL) in the LASTCOUNT which gets incremented with every update. The primary server monitors the LASTCOUNT of standby server at the specified time interval. When the standby fails to update the LASTCOUNT, the primary assumes that the standby had failed and it cancels its registration as well as its entries from the BEFailOver table. This would enable us to connect a new standby server.
What happens when the Primary Server fails ?
When the primary server fails, it fails to update the LASTCOUNT. The standby server keeps monitoring the primary's LASTCOUNT at a specified periodic interval known as FAIL_OVER_INTERVAL. Supposing, you have specified FAIL_OVER_INTERVAL as 50 seconds, the standby will monitor the primary's LASTCOUNT for every 50 seconds. Everytime, when the secondary server looks up the LASTCOUNT, it compares the previous and present counts. When the primary server fails to update the LASTCOUNT, consecutive counts will be the same and the standby assumes that the primary had failed. Here, a parameter named RETRY_COUNT comes into play which enables us to specify the number of times the standby has to check the primary's LASTCOUNT (when the primary fails to update the LASTCOUNT) before assuming that the primary had failed.
Once it finds that the primary had failed, it immediately changes its mode as PRIMARY and assumes all the functions that were being performed by the hitherto active server. Now, the Front End Server switches over its connection to the Standby server. Clients connected to the FEs continue to receive service without any knowledge about the Failover. Hence, the clients connected to the Web NMS front end servers are not affected by this failover process, except during the brief failover interval itself, when they may get some timeouts.
The following Time Line diagram depicts the flow of the complete happenings in the failover process.

|
|
Note: In order to setup successful BE Failover, it is necessary that both the primary and the secondary server machines are DNS enabled. |
Assume that you have BE servers installed in two machines namely Machine 1 and Machine 2. You want to start the BE server in Machine 1 as primary and the BE server in Machine 2 as standby. Then you can do the same by the following steps:
Start the BE server in Machine 1.
Copy the hibernate.cfg.xml file of the BE server (in Machine 1) and replace it under the <Web NMS Home>/classes/hbnlib directory of the BE Server in Machine 2.
Modify the DB host name entry from localhost to the hostname where the DB server is running. (in case the BE Failover is setup in a different machine)
Start the BE server in Machine 2.
The above steps would successfully set up the BE failover environment. When you start the Primary Server, an entry is made in the BEFailOver table with the following information about the Primary Server: HOSTADDRESS, NMSBEPORT, RMIREGISTRYPORT, LASTCOUNT, SERVERROLE, and STANDBYSERVERNAME. When another BE Server is started with the same database parameters (i.e. with the same hibernate.cfg.xml file) as the primary server, the server checks if the BEFailover table has any entry with SERVERROLE as primary. If such an entry is found, then it registers itself as a standby server to the primary server.
Now, the primary server will run in Machine 1 and the standby server will run in Machine 2. When the primary server fails, the standby server will take over the tasks of the primary server automatically.
|
|
Note: Until Web NMS 4.7.0 SP1, users must explicitly specify the BE server mode, (PRIMARY or STANDBY) in which the BE server should be started, in the <Web NMS Home>/conf/FailOver.xml.
From Web NMS 4.7.0 SP2 onwards, when a BE server is started, Web NMS framework will directly read the database (instead of reading the <Web NMS Home>/conf/FailOver.xml) to know if there is already a PRIMARY server running or not. If there is already a Primary server running then the current BE server will be started in STANDBY mode otherwise the current BE server will be started in PRIMARY mode. |
In the failover process, the primary server failure and the subsequent take over by standby are the two very important happenings and it is necessary that all the modules be notified of these events. The notification mechanism achieves this.
When the primary server fails, the Standalone FE detects the communication
break and it immediately sends a notification regarding the failure to
all the modules which have registered for receiving such a notification.
This is known as preBEFailOverNotification
. Similarly, it sends another notification soon after the
standby server had successfully taken over and all the connections had
been restored. This is known as postBEFailOverNotification
.
For further information about the BEFailOverListener interface, please refer to the javadocs.
How to Register as Failover Listener ?
The modules which desire to receive preBEFailOverNotification
and postBEFailOverNotification should register themselves as FailOverListeners.
For registering as listeners, the modules should implement the com.adventnet.nms.fe.common BEFailOverListener interface
.The following code shows how a module can register itself
as a failover listener for receiving notifications.
|
public class TestAPIProxyImpl implements BEFailOverListener { public TestAPIProxyImpl() PureServerUtilsFE.clientSocketFE.registerBEFailOverListener(this); } public void preBEFailOverNotification(BEFailOverEvent event) { ....... }> ....... } } |
Example
A Reference Implementation explaining how to setup FailoverListener is available in the examples section.
How to De-register ?
If any module has to be disabled from receiving the notification mechanism, it can be done. The following code snippet shows how this can be achieved.
|
// Removes FailoverListener for a module PureServerUtilsFE.clientSocketFE.deRegisterBEFailOverListener(this); |
The <Web NMS Home>/conf/FailOver.xml contains user inputs to be passed to the failover framework. Users can use this FailOver.xml file to pass their inputs to the Primary server and the Standby server. The general structure of the FailOver.xml is as below.
|
<FAILOVER> <PRIMARY HEART_BEAT_INTERVAL="60" /> <STANDBY FAIL_OVER_INTERVAL="60" RETRY_COUNT="1"> <BACKUP ENABLED="TRUE" BACKUP_INTERVAL="600" > <INCLUDE> <DIR NAME="myDir1"/> <FILE NAME="myFile1"/> </INCLUDE> <EXCLUDE> <DIR NAME="myDir2"/> <FILE NAME="myFile2"/> </EXCLUDE> </BACKUP> <SEND_EMAIL SMTP_SERVER="mail-server1" TO_ADDRESS="xyz@webnms.com" FROM_ADDRESS="webnms@webnms.com" SUBJECT="Web NMS Primary Server Failed" BODY="The Web NMS Back End Server is failed and taken over by the Hot Stand By Server"/> </STANDBY> </FAILOVER> |
|
|
Note: Until Web NMS 4.7.0 SP1, FailOver.xml will contain either the
PRIMARY entries or the STANDBY entries, not both. If the FailOver.xml
contains both the primary and standby entries then the server will not
be started. Hence when the role of the server changes from PRIMARY to
STANDBY or STANDBY to PRIMARY, FailOver.xml
has to be changed manually to pass the inputs required for the current
mode / role of the server.
To enhance the backup operations done during the failover, new entries have been introduced in the FailOver.xml which are explained in the Configuration Options section. The new entries introduced in Web NMS 4.7.0 SP2 are in bold letters in the FailOver.xml entries given above. |
For Primary Server
The parameter HEART_BEAT_INTERVAL is the only parameter to be configured for primary server.
| Parameter | What does the parameter specify? |
Default value |
|---|---|---|
|
HEART_BEAT_INTERVAL (in seconds) |
This parameter specifies the periodic time interval during which the primary server keeps updating the LASTCOUNT in the BEFailOver table in the database. |
60 seconds |
For Standby Server
| Parameter | What does the parameter specify? |
Default value |
|---|---|---|
|
FAIL_OVER_INTERVAL (in seconds) |
The standby server keeps monitoring the primary's LASTCOUNT at a specified periodic interval known as FAIL_OVER_INTERVAL. |
60 seconds |
|
RETRY_COUNT |
This parameter specifies the number of times the standby has to check the primary's LASTCOUNT (when the primary fails to update the LASTCOUNT) before assuming that the primary had failed. |
1 |
This Failover setup provides options for taking backup of the configuration files when failover occurs and to send e-mails to anyone regarding the occurance of failover. You may configure these through the following parameters in Failover.xml file.
|
|
Note: It is recommended to have the HEART_BEAT_INTERVAL for the PRIMARY server and the FAILOVER_INTERVAL of the STANDBY server above 20 seconds. Having lesser values may lead to unexpected failover. |
Configuring Backup
The following parameters provide the option for taking a backup of the configuration files from the primary server and carry them over to the standby to make both the configuration files in sync with each other.
| Parameter | What does the parameter specify? |
Default value |
|---|---|---|
|
ENABLED (true/false) |
If set to "true" this backs up the configuration files. If both the primary and the standby Servers share the same conf file mounted in a common system, this configuration can be set to "false". |
true |
|
BACKUP_INTERVAL (in seconds) |
This parameter is used to specify the periodic time interval during which the configuration files have to be backedup. |
600 |
Including or Excluding directories during back up process.
From Web NMS 4.7.0 SP2 onwards, provision has been provided for the users to configure the directories to be taken back up. Users can specify the directories that should be taken back-up during the fail over. For this a new tag <INCLUDE> is introduced in the FailOver.xml. Users can also specify the directories that should not be taken back-up during failover. For this a new tag <EXCLUDE> is introduced in the FailOver.xml. Combining the INCLUDE and the EXCLUDE tags, users can include a parent directory and filter the unwanted directories and files under this parent directory. To specify directories and files new tags <DIR> and <FILE> are introduced. Below are the explanations for these new tags.
|
Tag |
What does the tag specify? |
|---|---|
|
DIR |
Directory to be included or excluded, can be specified using this tag. Name of the directory should be given in the NAME attribute of this tag. Directory name should given relative to the Web NMS Home. |
|
FILE |
File to be included or excluded, can be specified using this tag. Name of the file should be given in the NAME attribute of this tag. File name should given relative to the Web NMS Home. |
|
INCLUDE |
Include tag contain DIR and FILE tags. The directories specified inside this tag, will be taken back-up during the failover process. |
|
EXCLUDE |
Include tag contain DIR and FILE tags. The directories specified inside this tag, will not be taken back-up during the failover process. |
Configuring E-mail Action
This following parameters give us the option to send e-mails about the failover process to any listener.
| Parameter | What does the parameter specify? |
Default value |
|---|---|---|
|
SMTP_SERVER |
The SMTP mail server address |
--- |
|
FROM_ADDRESS |
Sender's e-mail address. |
--- |
|
SUBJECT |
Subject of the mail. |
--- |
|
BODY |
The message which you wish to convey |
--- |
|
|
Note: There is no constraint on having any value for HEART_BEAT_INTERVAL or FAIL_OVER_INTERVAL. However, it is always advisable to give a greater value for FAIL_OVER_INTERVAL compared to the HEART_BEAT_INTERVAL. If the values for both of them are equal, if there occurs a delay on the part of the primary to update the LASTCOUNT during which time the standby, upon finding that the primary had not updated the LASTCOUNT, would presume that it had failed and would try to assume the role of PRIMARY. |
Configuration using Web NMS EclipsePlugin
Import the file FailOver.xml from <Web NMS Home>/conf directory into the Eclipse Project. Open the file in the project. Edit the file and click Save to save the changes.
Refer the section Working with Files in Eclipse Guide for more details on how to import WebNMS configuration Files to Eclipse.
In case, you do not need BE FailOver mechanism and wish to disable it, you can do so as stated below :
In the startnms.bat/sh file present under <Web NMS Home>/bin directory, add the following entry as a command line argument while calling the NmsMainBE.
NMS_BE_FAILOVER false
The Primary and Standby BE servers are redundant configurations designed to serve the same functionality. They both have access to the same database. When the primary server fails or is brought down, the standby server takes over the functions that were being performed by the primary. The primary server may be brought down for scheduled maintenance.
The standby server of Web NMS offers warm standby support. Any operation or request in the Network during the intervening period (i.e., the time period between the failure of primary and the subsequent complete take over by standby) will be lost. Also during this failover period, the critical traps/notification that are sent from the Agents could be missed. To avoid the loss of notifications you can enable the hot standby fault failover in standby server.
For more information on its working mechanism and setup procedure, refer to the Hot Standby Fault Failover README located in the <Web NMS Home>/default_impl/failover directory.
I am getting the following error message. What went wrong ? Warning: Cannot start Web NMS Server in Primary mode as specified in FailOver.xml file
|