🏰Continuity & Resilience

How resilient is the archive?

Overview

The Archive is designed and maintained to meet the highest standards of data integrity, availability, and resilience.

The Archive is delivered as a high-availability, self-healing, horizontally scaling cluster.

At a high-level, the solution comprises a highly redundant Kubernetes cluster with a multitude of independent and redundantly provisioned components working in harmony. This means that there is no service outage if any component goes down, as that component can be taken offline and repaired without affecting the others.

The Archive leverages AWS S3, EC2 and RDS services which carry the following SLAs:

https://aws.amazon.com/s3/sla/

https://aws.amazon.com/rds/sla/

https://aws.amazon.com/compute/sla/

Each of these services has a minimum of a 99.9% uptime guarantee, and actual historic uptime is considerably higher as evidenced here:

https://aws.amazon.com/premiumsupport/technology/pes/

All system databases and key components are backed up and retained independently.


Backup Procedures and Policies

Database Backups:

AWS RDS is configured for automated daily backups, along with continuous transaction log backups, enabling point-in-time recovery. Backup retention policies are set based on data requirements with minimum 30 day retention.

Backups are encrypted at rest using AWS KMS-managed keys (AES-256) to ensure data security and compliance with industry standards.

Application Data Backup:

Data is also stored within Amazon S3 storage and encrypted. AWS S3 carries industry leading guarantees against data loss: https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html

Backup Validation:

Backups undergo routine daily integrity checks to ensure completeness and consistency, aligning with data protection best practices.


Restore Processes and Assurance

Database Restoration:

AWS RDS supports automated restoration from backups to any specific point within the retention period, tested and validated during regular DR exercises.

Application Data Recovery:

CEPH offers high availability, mounted distributed AWS EBS volumes are routinely snapshot and provide secure and rapid recovery.

Application Recovery:

Kubernetes manifests, Helm charts, and Infrastructure-as-Code (IaC) scripts enable rapid, reliable re-deployment of the solution

Testing Frequency:

Restore procedures are tested quarterly as part of our formal disaster recovery (DR) drills to ensure operational readiness.

Whole new environments are destroyed and recreated from backup in different AWS regions & zones as part of our operational recovery


Resilience and High Availability Measures

Database Resilience:

AWS RDS is deployed with a multi-AZ configuration to provide automatic failover and near-zero downtime in the event of infrastructure issues.

Application Data Resilience:

Application data is stored on AWS S3 in secure non-public buckets. AWS S3 guarantees that data remains safe and accessible, even in the event of component or regional failures. Its 11 nines of durability reflects the robustness of its design for preventing data loss.

https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html

Kubernetes Cluster Resilience:

Our Kubernetes environment leverages self-healing capabilities, with redundant nodes and pods automatically restarted in the event of failures. Pod disruptions are managed through monitoring and automated rolling updates. Replica strategies are configured to maintain service continuity.

Disaster Recovery (DR):

RPO and RTO times vary for different services and solution elements, but every component is individually redundant locally and across AWS availability zones.

Assuming a full service outage, the RPO for Archive and usage data would be under 5 minutes. Journalling might be suspended during a major outage, but would resume and backfill once the service recovered, ensuring gapless archiving.

The entire system can be deployed to a new availability zone within 2 hours and can be redeployed to any AWS region from Infrastructure as Code in under 5 hours.

Around 1/3rd of total system infrastructure would have to be offline for a full service outage to occur, which would likely indicate a serious AWS outage. Therefore any continuity plan looking to account for this should to consider wider ramifications for AWS and dependant services.


Compliance Testing and Monitoring

Yearly Independent Audits:

Backup, restore, and resilience mechanisms are subject to regular audits and monitoring to verify compliance with ISO/IEC 27001 & SOC 2 criteria (Availability and Confidentiality).

Continuous Monitoring:

Real-time monitoring and alerting ensure rapid detection and response to potential service disruptions or backup anomalies.

Incident Response:

A documented incident response plan, with defined escalation paths, ensures any issues affecting backup or resilience are addressed in a compliant and timely manner.

Last updated

Was this helpful?