The RosettaHealth Disaster Recovery Policy establishes procedures to recover RosettaHealth following a disruption resulting from a disaster. This Disaster Recovery Policy is maintained by the RosettaHealth Security Officer and CTO.
The following objectives have been established for this plan:
-
Maximize the effectiveness of contingency operations through an established plan that consists of the following phases:
-
Notification/Activation phase to detect and assess damage and to activate the plan;
-
Recovery phase to restore temporary IT operations and recover damage done to the original system;
-
Reconstitution phase to restore IT system processing capabilities to normal operations.
-
-
Identify the activities, resources, and procedures needed to carry out RosettaHealth processing requirements during prolonged interruptions to normal operations.
-
Identify and define the impact of interruptions to RosettaHealth systems.
-
Assign responsibilities to designated personnel and provide guidance for recovering RosettaHealth during prolonged periods of interruption to normal operations.
-
Ensure coordination with other RosettaHealth staff who will participate in the contingency planning strategies.
-
Ensure coordination with external points of contact and vendors who will participate in the contingency planning strategies.
This RosettaHealth Disaster Recovery Policy has been developed as required under the Office of Management and Budget (OMB) Circular A-130, Management of Federal Information Resources, Appendix III, November 2000, and the Health Insurance Portability and Accountability Act (HIPAA) Final Security Rule, Section §164.308(a)(7), which requires the establishment and implementation of procedures for responding to events that damage systems containing electronic protected health information.
Example of the types of disasters that would initiate this plan are natural disaster, political disturbances, man made disaster, external human threats, internal malicious activities.
RosettaHealth defined two categories of systems from a disaster recovery perspective.
-
Critical Systems. These systems host application servers and database servers or are required for functioning of systems that host application servers and database servers. These systems, if unavailable, affect the integrity of data and must be restored, or have a process begun to restore them, immediately upon becoming unavailable.
-
Non-critical Systems. These are all systems not considered critical by definition above. These systems, while they may affect some secondary capabilities of the platform, do not prevent Critical Systems from functioning and being accessed appropriately. These systems are restored at a lower priority than Critical Systems.
Applicable Standards
Applicable Standards from the HITRUST Common Security Framework
- 12.c - Developing and Implementing Continuity Plans Including Information Security
Applicable Standards from the HIPAA Security Rule
- 164.308(a)(7)(i) - Contingency Plan
Line of Succession
The following order of succession to ensure that decision-making authority for the RosettaHealth Contingency Plan is uninterrupted. The Chief Technology Officer (CTO) is responsible for ensuring the safety of personnel and the execution of procedures documented within this RosettaHealth Contingency Plan. If the CTO is unable to function as the overall authority or chooses to delegate this responsibility to a successor, the CEO shall function as that authority. To provide contact initiation should the contingency plan need to be initiated, please use the contact list below.
-
Kevin Puscas, CTO: (301) 919-2978, kevin.puscas\@rosettahealth.com
-
Buff Colchagoff, CEO: (202) 345-0298, buff.colchagoff\@rosettahealth.com
Responsibilities
The RosettaHealth Tech Team is responsible for coordinating with ClearDATA in the recovery of the RosettaHealth production environment in AWS to include AWS services, network services, and all EC2 servers.
The RosettaHealth Tech Team is directly responsible for assuring all RosettaHealth Platform components are working. It is also responsible for testing redeployments and assessing damage to the environment.
Testing and Maintenance
The CTO shall establish criteria for validation/testing of a Contingency Plan, an annual test schedule, and ensure implementation of the test. This process will also serve as training for personnel involved in the plan’s execution. At a minimum the Contingency Plan shall be tested annually (within 365 days).
Disaster Recovery Scenarios
The RosettaHealth Platform is built to be highly resilient with components distributed across multiple data centers operated by AWS and Rackspace. This provides a level of high-availability and resiliency to the production environment. However there still remains the possibility, however remote, that loss of a data center or a platform wide issue could cause loss of critical capability of the platform.
Disaster Recovery Procedures
Notification and Activation Phase
This phase addresses the initial actions taken to detect and assess damage inflicted by a disruption to RosettaHealth. Based on the assessment of the Event, the Recovery Phase may be activated by either the CTO/CEO.
The notification sequence is listed below:
-
The first responder is to notify the CTO/CEO. All known information must be relayed to the CTO/CEO.
-
The CTO/CEO is to notify team members and direct them to complete the assessment procedures to determine the extent of the service interruption and estimated recovery time.
-
CTO/CEO determines that the event is adversely impacting multiple customers either directly or indirectly and begins the recovery phase.
Recovery Phase
This section provides procedures for recovering the RosettaHealth Platform operations. The goal is to restore RosettaHealth Platform to an acceptable production state.
The tasks outlines below define the Action Plans for each scenario.
Scenario 1 - Platform wide issue:
-
Contact Partners and Customers affected
-
Compile list of impacted platform components. These should include
-
Ec2 Instances
-
spring-boot components
-
HISP Components (James, tomcat, Dovecot)
-
mirth interface instances
-
nginx and haproxy web servers
-
php portal applications
-
StrongSwan VPN appliances
-
Sftp services
-
iptables
-
Ec2 OS
-
-
AWS Services
-
Aurora MySql DataBase
-
Athena Database reporting engine
-
Lambda functions
-
Step Functions
-
S3 storage
-
SFTP Services
-
MongoDb
-
AWS Networking (VPC, ALB, NLB, API Gateway, Security Rules)
-
-
-
Determine if roll-back will address the impacted components.
Component | Roll-back action |
---|---|
spring-boot components | change service to the last know good working version (multiple past versions should still be on the Ec2). |
HISP Components (James, tomcat, Dovecot, DNS) | Rollback the Ec2 instance to the last know working version. |
mirth interface instances | Rollback the Ec2 instance to the last know working version. Note may require re-installing to the last know working version and spring-boot apps that also run on those Ec2 servers |
nginx and haproxy web servers | Rollback the Ec2 instance to the last know working version. |
php portal applications | Rollback the Ec2 instance to the last know working version. |
StrongSwan VPN appliances | Rollback the Ec2 instance to the last know working version. |
Sftp services | Rollback the Ec2 instance to the last know working version. |
Ec2 OS | Rollback the Ec2 instance to the last know working version of the root ebs volume. |
Special Considerations:
-
Roll-back of Ec2 instances will require working with the the ClearDATA team to coordinate the roll-back and restoration.
-
Depending on the instance, roll back may require dismounting the /data volume and remounting the latest /data volume
-
For each roll-backed component, confirm by watching the live logs that the component is working as expected.
-
Visually verify logging, security, monitoring and alerting functionality for components
-
Scenario 2 - AWS Data Center/ Services interruption:
-
Contact Partners and Customers affected
-
Compile list of impacted platform components. These should include
-
AWS Services.
-
Auoroa MySql DataBase
-
Athena Database reporting engine
-
Lambda functions
-
Step Functions
-
S3 storage
-
SFTP Services
-
MongoDb
-
AWS Networking (VPC, ALB, NLB, API Gateway, Security Rules)
-
-
-
Determine dependent platform components and functions that are impacted.
-
For AWS Services, coordinate with ClearDATA for status of issues from AWS. Recovery will depend on evaluation
-
Once new AWS services have been established, verify that the RosettaHealth Platform components are running on the appropriate EC2 instances.
-
Test logging, security, monitoring and alerting functionality
Post-Recovery Phase
This section discusses activities to be preformed after recovery of platform capabilities have been confirmed by the RosettaHealth technical team.
-
Contact customers impacted.
-
Perform traffic impact analysis.
- Determine what traffic for customers may have been impacted
-
Document a Root-Cause Analysis
- Document the timeline of the event, cause, actions taken, any short term mediations
-
Determine any mediation/preventive actions that can be taken to prevent future events or can make the platform more resilient.
-
Update the Platform Risk Assessment as needed.
-
Perform an external vulnerability scan