disaster recovery

“The best-laid schemes o’ mice an’ men / Gang aft a-gley.” – Robert Burns. 

A potential derivative could be – “The best-laid plans for the application on the cloud often go awry and at the most unfortunate moment!”

Every system, given enough time despite the right set of conditions, will eventually fail. This is an irrefutable fact of nature. The same applies to the Cloud. You may have spent months designing a cloud-first architecture and deployed your application in it. But all it can take is one lightning storm or faulty undersea cable to knock your application completely offline, leading to a large number of unhappy customers. 

But this does not mean that we should just accept the status quo and shrug our shoulders. Your customer satisfaction will depend on how quickly your deployment can recover from such a disaster and get your services up and running again. Hence, “Disaster Recovery” should not be an afterthought, instead must have a place during the initial enterprise architecture strategy.

To understand this, let us consider a typical cloud infrastructure which has a service region and each region is known to have multiple data centers (also known as zones). Each zone is a physical building with its own infrastructure (compute, networking, storage, operations etc) power, internet connectivity and physical security. Each cloud provider has many such regions (typically 20+) around the world. 

In the above scenario, disasters can happen at multiple levels:

  • At the application service level i.e. an application crashes or a VM is knocked out
  • A managed service by the cloud (eg Database, Network) becomes unavailable
  • The entire availability zone is down
  • An entire region is out of service

It is highly unlikely that all the regions in an entire country will be down at the same time and the possibility of all the regions going down will mean only one thing – the earth is being hit by an asteroid! The rest of the article will focus on scenarios other than armageddon.  Exceptions are, of course, countries which might have only 1 region. 

Let us consider the following architecture which considers high availability as the core theme. 

  • Zonal resources such as VMs which are running in data centers
  • Regional resources such as load balancers running across multiple zones
  • Global resources such as DNS services which run cross-region 
disaster recovery

These services can be used to provide backup and handle disasters at multiple levels.

  • Every data center (or zone) will have multiple VMs running the same service. If a VM goes offline or crashes for some reason, traffic can be directed to the other VMs and new VMs can be launched using elasticity policies if required.
  • If the availability zone goes down, the load balancer can be used to direct traffic towards other zones which are deployed as redundancies. 
  • If the entire region happens to go down, traffic can be directed to application servers deployed in other regions within or even outside a country (if regulations allow).

You should note that when an application is replicated in other data centers, all tiers of the application server have to be replicated. This includes any storage options or database services that are deployed as part of the architecture.

Each of these services will have their own complexities related to replication that will have to be figured out. For instance, say our application is using a database that does support replication. However, the replicas will have to be perfectly in sync with the primary. The actual implementation will depend on the type of database being deployed. NoSQL databases on the other hand will support eventual consistency via asynchronous replication. In the case of RDBMS, replicas will have to be updated whenever there is any change to the primary instance. Perhaps an evaluation of database replication vis-a-vis eventual consistency replication evaluation needs to be done to figure out which option will deliver the best results along with ensuring the replica in the other region does not go out of sync. The key takeaway here is that disaster recovery options depend not just on infrastructural components but also on the choice of technologies. 

Disaster recovery on the cloud is a huge topic with many facets to consider. In the PGP-CC course, we go through a deep dive into each of these aspects, and over the course of 6 months, you will be able to put all these together to form a complete picture for your customers.

0

LEAVE A REPLY

Please enter your comment!
Please enter your name here

eighteen + 4 =