Rocket UniData and UniVerse Replication best practices: monitoring
Part 2 of 3
This blog post offers best practices related to implementing and using Replication, as part of a High Availability/Disaster Recovery (HA/DR) strategy. In Part 2, I’ll focus on monitoring.
As a best practice, everyone should use the Exception Action Script to keep Replication in sync even during unplanned network interruptions, heartbeat timeouts, etc. The Exception Action Script is fired off automatically for any non-DBA ordered suspension of Replication.
The Script has the ability to send an email to defined recipients in an organization informing individuals of the suspension. It then goes through a sequence of checks to ensure Replication has been suspended successfully and then waits two minutes before it attempts to correct the suspension. Emails inform the recipients of the success or failure at each stage. I originally developed the Exception Action Script to help a customer in the UK whose network administrators regularly took the network down for short intervals overnight and I know you’ll find it as valuable as many other customers have. The latest version of the tool keeps a log of the last 5 instances of when the Script ran although this is configurable.
The worst-case scenario is when Replication becomes disabled because it has run out of log space and no one spotted the original suspension. To recover, a full refresh of the standby system is necessary. The Script identifies when Replication is suspended and attempts to addresses the suspension automatically (no manual intervention and examination necessary), before adversely impacting users. The latest releases of UniData and UniVerse include an example Unix Script and will soon include an example Windows Powershell version.
Exception Action Script Best Practices
You specify where the Exception Action Script is installed in the repsys config file. Our recommendation is to install the Exception Action Script in the same location on both machines (primary and standby). Deploying the Exception Action Script in the same location on both machines makes it easy to have a central copy of the repsys file for maintenance and distribution.
You can have different scripts run from different locations on different machines within Replication. This does require an understanding of which script is executed on each machine and this is not completely intuitive at first glance and is explained within the manuals in greater detail or of course you can contact Support as well. This setup will not allow you to have one standard copy of the repsys file.
Use the Monitoring Phantoms available from Support
In addition to the Exception Action Scripts, you can also use the Monitoring Phantoms. These Phantoms perform two functions:
- The Phantoms monitor the Replication writer error logs every 15 minutes by default and send an email notification of any unexpected error. Otherwise, it’s a manual process to go in and look at the error log files at regular intervals to make sure things are running as expected. To repeat from my first blog post, Replication is not an administration free option.
- The Phantoms monitor the overall Replication status every 20 minutes and will send an email if an unexpected status is encountered or if progress on the standby machine has frozen. For example, the Phantom looks at how many updates have been applied on the standby and then looks at how many updates have been received and applied in the last 20 minutes.
Both Monitoring Phantoms are UniBasic programs. The source code is on your machine once you install these Phantoms and you are welcome to improve/enhance the code with your own extra monitoring and conditions. Starting at UniVerse 12 and UniData 9, the full documentation for the example Exception Action Scripts and Monitoring Phantoms will be provided with the install. For now, both are available on request from Rocket Support.
Using the XAdmin Tool
The XAdmin interface makes it easy to understand what is going on with Replication as represented by the traffic light symbols. This can be supplied to customers or used within your organization.
The Replication Status indicates whether the publisher and subscriber are connected. The status can be one of the following:
- Green – The publisher and subscriber are connected for all groups involved in replication.
- Yellow – At least one of the replication groups has been suspended by an administrator.
- Red – At least one of the replication groups has been terminated abnormally.
- The Sync Status indicates whether the subscribing (or standby) database is synchronized with the publishing (or primary) database. The status can be one of the following:
- Green – The publishing and subscribing databases are synchronized.
- Yellow – There are pending updates that have not been applied to the subscribing database.
- The Connection Status between the publishing server and the subscribing server in the group can be one of the following:
- Green – The publisher and subscriber are connected in this group.
- Yellow – The replication group has been suspended by an administrator.
- Red – The replication group has terminated abnormally.
- Blue – The group is currently subject to Replication Pacing
Replication Pacing gracefully slows down the UniVerse/UniData shells to reduce Replication Overflow and avoid Replication Disablement. Anything other than a zero value in the Total Pacing Weight field (even if the Connection Status is green), means the group has experienced pacing.
I’d like to also point out Replication disablement disk percentage (to the right of the traffic lights in the screen shot above). This shows you the space being used by the Replication logs. One way to easily check to see if the standby is behind the primary, is to reference the delta of the “subscriber committed” and the “publisher committed” figures (right above the disablement disk percentage in the screen shot above). An LSN (logical sequence number) is generated for each Replication log (it’s given an incremental number). Each log contains one update.
The number one goal of tuning should be to reduce Replication Buffer Overflow. If we can keep everything in the buffer, Replication moves very quickly. Once the Buffers are overflowed, the results are disk writes to log files (which take longer than using a memory structure).
We have previous presentations on the types of Replication Buffer Overflow and Replication Pacing. If you’re interested in this material, please add a comment at the end of this post.
You can have more than one Replication Group per account and each is allocated its own set of resources. As a best practice, we recommend using a combination of one account level group per account and multiple file level groups for an account. This helps balance the load across groups which helps improve performance. By dividing the load across multiple groups, a single group won’t become overwhelmed. Also, some files may work best in their own group.
Remember you can turn on Data Compression, which compresses the traffic from the primary machine to the standby machine. Data compression can provide a big benefit when the two machines are physically a long distance apart and where the network interface between them may be slow.
Field Level Updates and an Intelligent Queue Manager for repeated updates are available now in UniData 8.2 and soon in UniVerse 12. Instead of replicating the entire record, only the fields that have changed in the record are replicated. For example, if only 2 bytes of a 10k record have changed, that cuts down on the information going from the primary machine to the standby machine, which will be quicker (since we’re using less memory and reducing the chance of overflowing the buffers where we’d have to log updates to disk). The Intelligent Queue Manager efficiently handles records that get updated over and over (200k-300k times during a batch run for example), by writing only the last record update in list removes the requirement to repeat all of those writes on the subscriber.
There are several “screen scraper” tools available that provide access to performance data, some in its raw form and some where the raw form has been converted. The raw form data comes from reptool / uvreptool utility.
- Replication Performance Monitor – you can run through XAdmin, through reptool / uvreptool or through the package of Basic programs deployed as part of the Replication Monitor Phantoms Suite.
- REPLOGGER consists of server-side Basic programs and a JAVA tool called the ‘U2 Replication Analyzer’; I’ll cover REPLOGGER in the final part of this 3-part series.
Replication Performance Monitor
The Replication Performance Monitor, included as part of the Monitoring Phantoms, collects information from the Replication system and writes the information into sequential .csv files. The information logged will help you understand how much data is being put through Replication and by which files. You can use this information to help with sizing the parameters for each Replication group. You can also use this information to better understand whether more groups are needed and how the files can be divided between those groups. Finally, you can use this information to help identify files that don’t need to be replicated.
When you install the Replication Monitor Phantom Suite, you can access the options to gather the performance statistics from either:
- The Windows client-side interface or
- The account on the server where the suite was installed
The instructions on how to use the client-side and server-side interfaces are fully documented within the deployment package. The package also includes the latest versions of the Exception Action Script and the Replication Configuration Checker. You have a choice of two interfaces
- a client-side .NET GUI interface or
- you can run it on the server using a ‘green screen’ with a menu.
The Replication Monitor Phantom Suite also includes a Replication Configuration Checker program to scan the Replication configuration files and report back on common errors that can result in Replication not starting or performing badly; I developed this after talking to many customers and seeing a common need.
Replication Performance Monitor
There’s a client-side tool on .NET where you can do “start”, “stop”, and “gather” to get the performance results and you’ve got the same options to run using the menus on a green screen.
- Use ‘Start Performance Monitor’ to start the monitoring period
- Use ‘Stop Performance Monitor’ to stop the monitoring period
- Use the ‘Gather Results’ to gather the results
- PLEASE NOTE: If ‘Start Performance Monitor’ is used again before ‘Gather Results’, then the statistics from the last monitoring period will be lost
Once you ‘Gather Results”, monitoring will be stopped if it has not already been stopped. The tool gathers the results and places the results into.csv files located in the ROCKET.BP file (where you installed the Monitoring Suite). From here you can easily load the .csv files into Excel. The following files are produced:
Object Update Report
The Object Update Report is very useful. Use it to configure and size Replication and the Replication Logs. To do this, you need to know how many files are updated, how many times they’re updated and the resulting size of the log file from each update. This report provides this information:
- the account or the group it’s associated with
- the name of the file
- the number of updates performed and
- the size of the log
As I said, this report includes all the information you need to configure Replication. You might also see files that don’t need to be replicated in this report.
I hope you found Part 2 useful. Look for more best practices in part 3, where I’ll cover REPLOGGER, standard naming, the Replication Config Checker and producing a Recovery Blueprint. In the meantime, if you want to listen to the entire Replication Best Practices webinar, please feel free listen and share within your organization.