Data High Availability with Objects Native Replication

Anirudha | Tue, 12/22/2020 - 06:13

Nutanix added Objects replicator support early 2020 which helped in achieving high availability by out of back replication. Replicator also helped in migrating the workload from non-Objects to Objects cluster. While this worked just fine, but it could not bring near sync replication capability  and configuring Replicator in another VM was still overhead.Now with Objects 3.0 release, Nutanix added Native replication support in Objects, which highly scalable and performant replication solution, it also provides nearSync-to-async replication capabilities. 

In this blog we will take a look at :

  1. Objects Replicator
  2. Objects Native Replication
    1. Nearsync Replication.
    2. Async Replication
  3. Data replication and Delete replication.
  4. Backup integration with Objects Replication. (Coming Soon)
    1. Failover and Failback.


Objects Replicator:

As I said above, Objects replicator was the first solution introduced to handle out of band replication and also migration of workload from Non-Objects solution to Objects. With the latest Objects replication release, now it also supports migrating non-S3 (i.e from legacy file system ) workload to Objects.

We have covered this at length in : Nutanix Objects Replicator : Replication support and  Migrating data from Legacy filesystem to Objects. We will try to cover migrating non-S3 backup target to Objects in this blog or upcoming blog.

So one of the questions you may ask is, is Objects Replicator no more relevant since Objects now supports native replication.

Short answer is NO. And the reason being, Objects replicator is more than just a replication tool, you can use it as a migration solution for migrating your data to Objects Or even migrating your workload from legacy file system to Objects S3. It will continue to evolve to support more usecases. So look out for up-comming cool features in this solution.
 

Objects Native Replication:

Even though Objects replicator provides an easy way for out of band replication, there is a lot of chance of optimization there. Achieving near-sync replication or building more intelligence in identifying user load on the cluster so replication can either speed up itself Or take a step back to give more precedence to business critical application, becomes more critical as you move more applications on Objects.

Native replication, uses system resources more effectively and efficiently, to not only give high availability but to also achieve sync-near replication if configured properly. In case it can not keep up with the all incoming I/Os then it gives more precedence to user workload and falls back to background replication . This also gives you high performance and can scale to thousands of buckets and billions of objects replication.

While having high availability of your data is extremely important, fine tuning your deployment and infrastructure is equally critical to use these features to their full potential. Even if Objects support near-sync replication there are many factors which can impact this, such as a low low network or high incoming rate.

  • Near sync Replication :

As soon as data is written to the source cluster, its immediately replicated to the destination cluster. This helps in achieving higher data availability with reliable failover and failback mechanism, e.g - if you have commvault backup running and for some reason your source Objects cluster goes down due to any disaster, then commvault can switch to destination Objects cluster. And continue to restore from the latest backup and push new backups.With the help of nearsync replication, data is replicated almost instantly so all the data pushed by backup until the time of disaster is replicated and commvault can restore user data. Near sync ensures faster data availability on both the sites so failover or failback of your application becomes easy.

  • Async Replication:

For various reasons, if your Objects cluster is not able to keep up with the replication rate, then it falls back to async replication. You are still guaranteed that all the data written on the source cluster is replicated to the destination cluster, but it may not happen instantaneously. Data is first written to the source cluster and then as a part of the background process, it gets replicated to the destination cluster. In this case, if any disaster happens in the source site, then it is possible that not all the data is replicated to the destination cluster, specifically backups which were in progress or just finished. And your application may not find the recently written data. Data integrity is always maintained, and once a source cluster comes online, replication begins. Completion of data replication depends on various factors such as network, platform and incoming IO rate.

 

Let's take a look at on these factors which can impact replication speed :

  • Network :

You need to make sure you have good network connectivity between source and destination cluster, so any data written to source is instantly replicated to destination cluster. Given that your data has to move over the network, having lower speed or multiple hops in the underlying network can limit nearsync replication capability.

E.g : If you have configured replication between two clusters which are deployed on the same network and within the same datacenter, given they are in the same network, chances they will not suffer from any network bottleneck (unless you have too many hosts or higher network traffic going on in the network). In this case, replication can be much faster. Vs if you have configured replication between two clusters across diff geographical locations, then data now has to travel much longer and has to go through multiple network hops or even over internet/wan/vpn. So higher latency will impact near-sync replication and Objects may fall to async replication mode.

I would recommend you to take a look at the Networking blog @ here and here which I wrote a few months back. 

  • Dedicated deployment vs brown field deployment :

You can deploy Objects on the dedicated cluster or your existing cluster. Given Objects runs as a service on your AOS cluster, you can practically run Objects  along with all other workload. While this works just fine, but you have to remember you are sharing the AOS resources w.r.t compute or disk with all other workloads. E.g- if you have some noisy VMs running on your AOS cluster which takes away most of the compute and storage bandwidth, then this will impact Objects . And its capability to push more data during replication.

  • Choosing wrong hardware platform :

You can deploy your Objects cluster on any hardware as far as it is supported by the underlying AOS cluster. So I could practically go with NX1065 which has just 4 physical disks vs NX8155 which has 10 physical disks. Objects can efficiently use all the disks in the underlying AOS cluster and can achieve higher throughput if you have more disks in the cluster. Less disks would result in lower throughput. So choosing the right platform for Objects deployment and replication configuration is equally critical, and a lot depends on this decision.

Do refer my old blog on the platform @ here, it will give you some more details. 

  • Higher incoming requests:

Let's consider that you have taken care of the network and hardware platform very well, but if your incoming IO rate on the source Objects cluster is very high then Objects will give more priority to ongoing IOs than replication. Which means, replication will take a back step and will happen as a part of background tasks which may take longer depending on load on the cluster.

  • Uneven clusters on source and destination:

Now even if you take care of all the points above, you may still see issues if you have uneven clusters chosen for replication. E.g- I have a 12N Objects cluster running on DX380 which gives me high throughput to support ongoing IOs as well as archives near-sync replication. But my source cluster is only 3N running on the same platform. So in this case my  cluster is much more powerful than the destination, and given the small size of the destination cluster, it may never be able to keep up the replication requests (if it comes at full speed). And if this happens, then replication will again take a step back and will fall back to async replication. It will happen at the speed of your destination cluster.

 

While you deploy your cluster and set up replication for high availability, make sure you understand some of these limitations in more detail and make right decisions. I would highly recommend you to go through Objects official documentation and talk to your Nutanix representative who can help you to make these decisions.

 

Next Read - Configuring Objects Replication.