Preparing for the Worst
At a previous job, one of our systems notified us there had been a disk failure. I was not too concerned, as we had redundancy built into the disk array. In short order; however, we had several other disks fail, and the system came down hard. We would later find out there was a manufacturing issue that caused so many disks to fail in such a short amount of time.
The good news was that we had a backup from the night before, and this was a high-availability (HA) machine for our production system. Although reports were created on this partition, the outage didn’t cause too many issues for critical production processes. We called back the tapes from our offsite storage, and began a full system restore. Once the system was restored, we allowed the replication software to catch up. The system was back up and running normally within 72 hours.
Had that been our production system there would have undoubtedly been a lapse in service as we switched to our HA server, and some data loss due to gaps between what the production system was processing at the time, and what had been applied to the HA system’s database. These were risks we as a team, and as a company, had planned for and acknowledged based on business need, and what we were able to budget for disaster recovery. The point is we had planned to be able to recover from a disaster. We had also tested the processes involved in executing such a recovery many times.
Often when we think of a DR event, images of tornadoes, earthquakes, and plane crashes come to mind. We think those events won’t happen to our business. There are many more threats than the catastrophic kinds we tend to conjure in our heads. Things like power outages, data or objects being deleted by mistake, or data center mishaps are much more common. (Ever had someone mistakenly IPL the system while installing a PTF?)
The harsh truth is that if you do business long enough, at some point you are likely to have an unplanned outage. If you are reading this and aren’t in a good place with your DR strategy now is the time to fix that. I will be presenting an introduction to disaster recovery on IBM i at POWERUp 2020 in Atlanta, GA. I look forward to seeing you there!