How to Decide Between Infrastructure-Level Versus Application-Level Data Replication

Replicating data for an application across multiple nodes provides improved scalability, availability, and disaster recovery capabilities compared to using localized storage in a single location.

That said, not all applications require data replication. It’s most often necessary for commercial software that will be running in customer environments across various different geographical regions, software that handles large amounts of data, and software that requires high availability (HA) and fault tolerance. For applications that don’t need to run in multiple different regions, can tolerate some level of downtime or performance degradation, and don’t need to worry about horizontal scaling, centralized storage (which is cheaper and easier to manage) is likely sufficient.

When considering how to handle data replication, the two main options are to use an infrastructure-level distributed storage provider or to replicate data from the application level.

What are the differences between infrastructure-level and application-level data replication?

Distributed data storage often happens at the infrastructure level, where the underlying system architecture manages data storage across multiple nodes running in different geographical regions. In this case, the application interacts with the distributed storage system to read and write data. Common examples of systems used for distributed storage at the infrastructure level are Rook Ceph or cloud providers like AWS S3.

With application-level data replication, the application itself handles data replication and retrieval through custom logic according to policies specific to the application. Common examples of application-level data replication include etcd, Postgres, and MongoDB, which can all replicate their data to peer nodes.

Should I use infrastructure-level distributed storage or application-level data replication for my application?

For ISVs delivering an application to self-hosted customers, there are several factors to consider when deciding whether to use distributed storage at the infrastructure level or to handle data replication from the application. As with most aspects of distributing modern self-hosted commercial software, this decision depends on the unique requirements of the application, the customers, and the ISV. This blog touches on a few of the factors that are often most important to consider.

What level of performance do you need?

It’s not necessarily easy to answer the question of your application’s performance needs today, nor easy to predict what those performance needs might be in the future. Because of this, it’s common for ISVs to simply default to infrastructure-level distributed storage, while underestimating or misunderstanding how that choice could lead to significant issues related to performance down the road.

Application-level data replication will almost always provide better performance because it gives ISVs greater control around the specific strategies for data management. For example, with application-level control of data replication, ISVs can decide:

The data that needs to be replicated versus the data that it’s not worth the extra overhead and cost to replicate
How frequently different types of data are replicated
When synchronous versus asynchronous data replication is used
The types of data that are in cold storage versus those that are cached for quick access

Conversely, with distributed storage at the infrastructure level, these types of policies are dependent on the storage provider, giving the ISV fewer opportunities for performance optimization. While many of the cloud providers do provide several options for hardware with high IOPS, the cost to utilize these is not trivial compared to local storage. If replication is not needed, this can add a significant barrier to adoption. Moreover, if your application could be installed in a bare metal customer environment with a disk that’s already not as fast as you would like, and that disk is used by an infrastructure-level distributed storage system like Rook Ceph, then the throughput will be reduced even more compared to what you would have gotten natively.

For workloads where performance is not a top concern, such as if read and write operations are performed only occasionally or if it’s not important that the data is able to be accessed quickly, the ease of handling distributed storage at the infrastructure level could be a better option. If, however, the application is dependent on workloads where performance is critical, such as workloads that have a lot of messages or perform write operations very frequently (such as when working with time-series data), managing data replication at the application level will likely be preferable.

Will your app need to run in bare metal environments?

It is also important to consider the types of customer environments where the application will be running. Particularly for applications that will be installed on bare metal (physical servers) in customer environments, application-level data replication will usually be the best option. This is because, for many enterprises that need to install on bare metal, getting access to the dedicated hard drives and the network interfaces required for an infrastructure-level distributed storage system is challenging (and sometimes impossible).

For example, in bare metal environments, Rook Ceph is the primary option for handling distributed storage at the infrastructure level. Rook requires a dedicated, unformatted block device attached to each node that can’t be used for any other purposes. Additionally, to get the best performance possible, 10g networking is required (and, it still won’t match the performance of local storage where replication is handled at the application level). These types of hardware requirements can often be too expensive or otherwise difficult to provision for many enterprises.

If, however, the customer is not actually installing on bare metal and is instead going to be running the application on a virtualization platform like VirtualBox or vSphere, infrastructure-level distributed storage is more of a possibility. For example, with infrastructure-level distribution in a virtualized platform environment, it is much easier compared to bare metal for the ISV to accommodate customers with particularly high utilization by simply setting up their storage class to request higher performance drives. This is much easier when compared to a solution like Rook Ceph in a truly bare metal environment, where the request to accommodate high utilization would instead be for the customer to fully replace their hard drives.

So, when deploying on bare metal, being able to handle storage replication at the application level removes the burden of the customer having to provide the hard drives necessary to support an infrastructure-level distributed storage system. In this way, handling replication at the application level can give the ISV more confidence in their application’s ability to run in any environment, including bare metal.

How do you want to handle failover?

Most providers of infrastructure-level distributed storage have built-in failover solutions that will switch seamlessly to a backup system when the primary fails, without the need for manual intervention. For example, when using Rook Ceph, the cluster will typically still be running even when it’s in an unhealthy state, giving the customer or ISV support team more time to troubleshoot as compared to a situation where a failure would leave the workload down and unusable.

Infrastructure-level distributed storage providers are also typically extremely reliable and well-tested, especially when it comes to avoiding data loss. Amazon EBS, for example, claims to have never lost data in an EBS volume for the life of the service. Overall, an ISV is much less likely to lose data when using an infrastructure-level provider compared to managing their own Postgres. This combination of ease-of-use and reliability offered by infrastructure-level distributed storage systems can represent a significant benefit for many ISVs.

With application-level data replication, failover can be more challenging because of the additional time and effort required to build a custom solution. Building your own failover solution might also come with less confidence in the ability of the system to handle failures appropriately without any loss of data. For workloads where data loss would be catastrophic, ISVs might be more confident using the failover and disaster recovery capabilities offered by the infrastructure-level distributed storage providers rather than relying on a bespoke solution.

One benefit of handling failover at the application level is the ability to design a high availability (HA) solution with zero downtime when there’s a failure. For example, Postgres can be run in a configuration where the loss of a node is completely transparent to users.

In contrast to designing an application-level solution, using distributed storage for HA typically requires downtime of about five minutes at a minimum. This is because, with Kubernetes, when a node goes offline, the default timeouts will take five minutes before it’s decided that the node is truly down and pods can be rescheduled. Plus, there will be additional downtime while the pod starts and becomes ready to serve data. In this case, ISVs should consider whether or not they (and their customers) can accept five or more minutes of downtime when a node goes down in exchange for the added reliability and confidence against loss of data that comes with an infrastructure-level system like Rook Ceph or AWS EBS.

Summary

ISVs deciding between infrastructure-level distributed storage or handling data replication at the application level should consider several factors, including the performance needs of their application workloads, the types of customer environments where the software will be running, and their failover and HA requirements.

The table below highlights the key factors discussed in this blog that ISVs should consider when deciding between infrastructure-level distributed storage or application-level data replication:

How to Decide Between Infrastructure-Level Versus Application-Level Data Replication

What are the differences between infrastructure-level and application-level data replication?

Should I use infrastructure-level distributed storage or application-level data replication for my application?

What level of performance do you need?

Will your app need to run in bare metal environments?

How do you want to handle failover?

Summary

Company

Projects

Developers

Find Us On

Subscribe to our newsletter